Search code examples
cassandraschematime-seriesdatastaxdatastax-enterprise

Time Series schema design in Cassandra


All ,

We are doing a POC for an iOT based application. The chosen data base is cassandra. We will be receiving time-series data from devices mounted on vehicles. The major attributes for the time series data is given below

  • TimeStamp :- Represents the date and time of the received data
  • DeviceId :-UniqueId of the device mounted on the vehicle
  • Latitude Current latitude of the vehicle
  • Longitude Current Longitude of the vehicle
  • Speed Speed of the Vehicle

We are planning to make the month and year as the partition key and the device id and the time stamp as the clustering keys... Is this the best way for fetching the data using the following type of queries

  • Retrieve the data for a device with the DeviceId between a start date and end date
  • Retrieve the data for all devices between a start date and end date

Thanks in Advance


Solution

  • Data modeling in Cassandra is best when done with a query driven approach. See this blog post for "Rules" in modeling Cassandra.

    Rule 1: Spread Data Evenly Around the Cluster

    Rule 2: Minimize the Number of Partitions Read

    You provided 2 queries in your question which differ in scope only. One is asking for data in time range by device id the other is data in time range agnostic of device id.

    Retrieve the data for a device with the DeviceId between a start date and end date

    Retrieve the data for all devices between a start date and end date

    The query your table(s) should support looks like the following:

    What is the lat, long, speed for device(s) x during time period y

    The number of data points should be considered in partitioning. What will be the normal time frame? Is it by minute, hour, day, week, month? That time frame should help determine how writes and partitions are handled. If you are partitioning on month and year that will work for sensor readings that will not be greater than 2 billion readings a month. See this SO answer for a good explanation on partitioning around the limits.

    Understanding partitioning is key to enable range result sets. See the following excerpt from the "Deep look at the CQL WHERE clause".

    You will not be able to use <, > operators on a partition key. (ALLOW FILTERING can get you around this but do not make that part of your core schema design.) The operators must be used on clustering columns.

    Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key.

    Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.

    SELECT * FROM numberOfRequests
        WHERE token(cluster, date) > token('cluster1', '2015-06-03')
        AND token(cluster, date) <= token('cluster1', '2015-06-05')
        AND time = '12:00';