Search code examples
cassandracqlcql3

Cassandra - duplicate timestamps with TimeUUID?


I have sensors which have a frequent rate of writing data to a log file. I want to store these logs into Cassandra and process them along with Spark.

I have thought about using a TimeUUID column for storing my timestamp to preserve the order automatically. My queries will heavily use range queries and thus I thought this might be ideal. However, my logs can contain duplicate timestamps due to the frequency of the logging. The logs are not streamed to cassandra; I am working with historical data only. The timestamp will be a part of my compound primary key. I can not think of a viable column that I could pull in to the row key to make the row with a duplicate timestamp unique.

The documentation says: "The values returned by minTimeuuid and maxTimeuuid functions are not true UUIDs in that the values do not conform to the Time-Based UUID generation process specified by the RFC 4122. The results of these functions are deterministic, unlike the now function."

When forcing the date of a TimeUUID, instead of using now, this might end up in overwriting previous data.

I will use Java/Scala to bulk-insert my historical data from .json to Cassandra. (Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4)


How can I have duplicate timestamps within my data?

  1. Do I use a TimeUUID(now) for my primary key and have the actual date/time stored in a different column? This would make me lose the benefits of having the actual date/time ordered already.
  2. Do I have to make sure that my Java/Scala application will generate valid, unique TimeUUIDs? If so, are there any common libs I can use?

Or are there other (better) options?

Thanks


Solution

  • Your idea to use timeuuids as a unique identifier is the proper approach. When properly done, you won't have duplicates. The timeuuid is a type 1 uuid which contains not only a timestamp, but also some entropy to guarantee uniqueness even for the same point in time.

    So, now the question remains - how should you generate timeuuids for your historical data? As you noted, the minTimeuuid/maxTimeuuid functions aren't good for generating a proper version 1 uuid. That's ok, because that's not their purpose. You'll need them later on when you're querying your data using time ranges:

    SELECT * FROM sensor_readings
       WHERE sensor_id = 123
       AND ts > maxTimeuuid('2016-07-15 00:00+0000')
       AND ts < minTimeuuid('2016-07-17 00:00+0000')
    

    Unfortunately CQL doesn't offer a function to generate them for a given timestamp (as of CQL 3.3) so your client must generate the uuid. There are a few Java libraries that will do it. See this question for some suggestions. Be sure to pick a quality library that guarantees uniqueness.