cassandra schema time-series datastax datastax-enterprise

Time Series schema design in Cassandra

All ,

We are doing a POC for an iOT based application. The chosen data base is cassandra. We will be receiving time-series data from devices mounted on vehicles. The major attributes for the time series data is given below

TimeStamp :- Represents the date and time of the received data
DeviceId :-UniqueId of the device mounted on the vehicle
Latitude Current latitude of the vehicle
Longitude Current Longitude of the vehicle
Speed Speed of the Vehicle

We are planning to make the month and year as the partition key and the device id and the time stamp as the clustering keys... Is this the best way for fetching the data using the following type of queries

Retrieve the data for a device with the DeviceId between a start date and end date
Retrieve the data for all devices between a start date and end date

Thanks in Advance

Solution

Data modeling in Cassandra is best when done with a query driven approach. See this blog post for "Rules" in modeling Cassandra.

Rule 1: Spread Data Evenly Around the Cluster

Rule 2: Minimize the Number of Partitions Read

You provided 2 queries in your question which differ in scope only. One is asking for data in time range by device id the other is data in time range agnostic of device id.

Retrieve the data for a device with the DeviceId between a start date and end date

Retrieve the data for all devices between a start date and end date

The query your table(s) should support looks like the following:

What is the lat, long, speed for device(s) x during time period y

The number of data points should be considered in partitioning. What will be the normal time frame? Is it by minute, hour, day, week, month? That time frame should help determine how writes and partitions are handled. If you are partitioning on month and year that will work for sensor readings that will not be greater than 2 billion readings a month. See this SO answer for a good explanation on partitioning around the limits.

Understanding partitioning is key to enable range result sets. See the following excerpt from the "Deep look at the CQL WHERE clause".

You will not be able to use <, > operators on a partition key. (ALLOW FILTERING can get you around this but do not make that part of your core schema design.) The operators must be used on clustering columns.

Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key.

Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.

SELECT * FROM numberOfRequests
    WHERE token(cluster, date) > token('cluster1', '2015-06-03')
    AND token(cluster, date) <= token('cluster1', '2015-06-05')
    AND time = '12:00';