I have sensors which have a frequent rate of writing data to a log file. I want to store these logs into Cassandra and process them along with Spark.
I have thought about using a TimeUUID column for storing my timestamp to preserve the order automatically. My queries will heavily use range queries and thus I thought this might be ideal. However, my logs can contain duplicate timestamps due to the frequency of the logging. The logs are not streamed to cassandra; I am working with historical data only. The timestamp will be a part of my compound primary key. I can not think of a viable column that I could pull in to the row key to make the row with a duplicate timestamp unique.
The documentation says: "The values returned by minTimeuuid and maxTimeuuid functions are not true UUIDs in that the values do not conform to the Time-Based UUID generation process specified by the RFC 4122. The results of these functions are deterministic, unlike the now function."
When forcing the date of a TimeUUID, instead of using now
, this might end up in overwriting previous data.
I will use Java/Scala to bulk-insert my historical data from .json to Cassandra. (Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4)
How can I have duplicate timestamps within my data?
Or are there other (better) options?
Thanks
Your idea to use timeuuids as a unique identifier is the proper approach. When properly done, you won't have duplicates. The timeuuid is a type 1 uuid which contains not only a timestamp, but also some entropy to guarantee uniqueness even for the same point in time.
So, now the question remains - how should you generate timeuuids for your historical data? As you noted, the minTimeuuid/maxTimeuuid functions aren't good for generating a proper version 1 uuid. That's ok, because that's not their purpose. You'll need them later on when you're querying your data using time ranges:
SELECT * FROM sensor_readings
WHERE sensor_id = 123
AND ts > maxTimeuuid('2016-07-15 00:00+0000')
AND ts < minTimeuuid('2016-07-17 00:00+0000')
Unfortunately CQL doesn't offer a function to generate them for a given timestamp (as of CQL 3.3) so your client must generate the uuid. There are a few Java libraries that will do it. See this question for some suggestions. Be sure to pick a quality library that guarantees uniqueness.