Search code examples
cassandracql

Cassandra and <=, => operators on dates without ALLOW FILTERING


I'm a noobie with cassandra and I dont understand why I cant filtering by dates (wanna return result between date) for example:

CREATE TABLE test.service_bar(
    service_bar_id UUID,
    start_date_time timestamp,
    end_date_time timestamp,
    title varchar,
    message text,
    is_active boolean,
    PRIMARY KEY((start_date_time, end_date_time))
);

then this work:

  SELECT start_date_time, end_date_time, is_active, message, service_bar_id, title
  FROM test.service_bar
  WHERE start_date_time = '2019-10-30 14:10:29'  AND end_date_time = '2019-10-30 14:10:29'
  LIMIT 500;

but this dosent

  SELECT start_date_time, end_date_time, is_active, message, service_bar_id, title
  FROM test.service_bar
  WHERE start_date_time >= '2019-10-30 14:10:29'  AND end_date_time <= '2019-10-30 14:10:29'
  LIMIT 500;

I cant use ALLOW FILTERING

how I can do such queries in Cassandra?


Solution

  • I dont understand why I cant filtering by dates (wanna return result between date)

    The behavior you are seeing is because of this:

    PRIMARY KEY((start_date_time, end_date_time))
    

    You have defined start_date_time and end_date_time as a composite partition key. As Cassandra uses a distributed hash to ensure proper data distribution, partitions are not stored in order of their values. They are stored by the hashed token value of the partition key. You can see this by using the token function on the partition key:

    aaron@cqlsh:stackoverflow> SELECT token(start_date_time,end_date_time),start_date_time,end_date_time,service_bar_id FROM service_bar ;
    
     system.token(
         start_date_time,
         end_date_time)   | start_date_time                 | end_date_time                   | service_bar_id
    ----------------------+---------------------------------+---------------------------------+--------------------------------------
        26346508703811310 | 2019-10-30 19:10:29.000000+0000 | 2019-10-30 19:10:29.000000+0000 | 49a70440-8689-4248-b389-13b8d0373e58
      1488616260313758762 | 2019-11-01 19:10:29.000000+0000 | 2019-11-01 19:10:29.000000+0000 | b0bab610-a285-41e7-ba5c-d56f8fb12f52
      2185622653117187064 | 2019-10-30 21:10:29.000000+0000 | 2019-10-30 21:10:29.000000+0000 | 3686c6a6-fd8d-4247-b501-964363a48f63
      7727638696734890177 | 2019-10-30 20:10:29.000000+0000 | 2019-10-30 20:10:29.000000+0000 | 97fc799e-fb54-4b7f-956e-f06bcb9e9d9d
    
    (4 rows)
    

    That's the default order of your rows. This is done because each node is responsible for specific token ranges, to ensure that data is distributed as evenly as possible in a multi-node cluster (which is the usual production use case). Because of this, CQL has some limitations on how you can query partition keys. These limitations are in-place to save you from writing bad queries...such as not allowing range queries on partition keys.

    how I can do such queries in Cassandra?

    This should also tell you, that you should build your tables and queries to ensure that they can be fulfilled by a request to a single node. Given that, your use case will really only work if you change your partition key.

    One way which dev teams implement solutions like yours, is to use a modeling technique called "time bucketing" or sometimes just "bucketing." In this case, let's say that you'll never write more than a few thousand entries per month. Maybe that's not the case, but I'll use it for this example. I can then partition on month, and then use the _time columns as clustering keys.

    CREATE TABLE stackoverflow.service_bar_by_month (
        month_bucket int,
        start_date_time timestamp,
        end_date_time timestamp,
        is_active boolean,
        message text,
        service_bar_id uuid,
        title text,
        PRIMARY KEY (month_bucket, start_date_time, end_date_time)
    ) WITH CLUSTERING ORDER BY (start_date_time DESC, end_date_time DESC);
    

    This will store all rows together by the value of month_bucket, and within each partition, the rows will be sorted by start_date_time and end_date_time in descending order. Now this works:

    aaron@cqlsh:stackoverflow> SELECT start_date_time, end_date_time, is_active, message, service_bar_id, title
                     ... FROM service_bar_by_month
                     ... WHERE month_bucket = 201910 AND start_date_time >= '2019-10-30 14:10:29'  AND start_date_time <= '2019-10-31 23:59:59';
    
     start_date_time                 | end_date_time                   | is_active | message           | service_bar_id                       | title
    ---------------------------------+---------------------------------+-----------+-------------------+--------------------------------------+--------
     2019-10-30 21:10:29.000000+0000 | 2019-10-30 21:10:29.000000+0000 |      True | This is an alert3 | eae5d3be-b2b2-40a1-aa28-0412fe9c18e6 | alert3
     2019-10-30 20:10:29.000000+0000 | 2019-10-30 20:10:29.000000+0000 |      True | This is an alert2 | af4ec72f-7758-42ef-b731-8d08f8a00006 | alert2
     2019-10-30 19:10:29.000000+0000 | 2019-10-30 19:10:29.000000+0000 |      True | This is an alert1 | 8b13db5c-9e39-4ee5-90a9-64758c5ab5be | alert1
    
    (3 rows)
    

    Do note though, that you can only enforce a range query on a single clustering key, like start_date_time above. This can't work:

    AND start_date_time >= '2019-10-30 14:10:29'  AND end_date_time <= '2019-10-31 23:59:59';
    

    And it can't work because Cassandra is designed to both read and write data from/to disk sequentially. Allowing range queries on multiple columns in a single query would require Cassandra to do random reads, which it just isn't good at. You can make it do that by using the ALLOW FILTERING directive, but that's not recommended. Although, using ALLOW FILTERING within small partitions would probably perform ok.