Search code examples
cassandracqlshscylla

What does "PER PARTITION LIMIT" means in cql query in cassandra?


I have a scylla table as shown below:

cqlsh:sampleks> describe table test;

CREATE TABLE test (
    client_id int,
    when timestamp,
    process_ids list<int>,
    md text,
    PRIMARY KEY (client_id, when) ) WITH CLUSTERING ORDER BY (when DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 172800
    AND max_index_interval = 1024
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

And I see this is how we are querying it. It's been a long time I worked on cassandra so this PER PARTITION LIMIT is new thing to me (looks like recently added). Can someone explain what does this do with some example in a layman language? I couldn't find any good doc on that which explains easily.

SELECT * FROM test WHERE client_id IN ? PER PARTITION LIMIT 1;

Solution

  • The PER PARTITION LIMIT clause can be helpful in a "wide partition scenario." It returns only the first two rows in the partition.

    Take this query:

    aploetz@cqlsh:stackoverflow> SELECT client_id,when,md 
            FROM test PER PARTITION LIMIT 2 ;
    

    Considering the PRIMARY KEY definition of (client_id,when), that query will iterate over each client_id. Cassandra will then return only the first two rows (clustered by when) from that partition, regardless of how many ocurences of when may be present.

    In this case, I inserted 7 rows into your test table, using two different client_ids (2 partitions total). Using a PER PARTITION LIMIT of 2, I get 4 rows returned (2 client_id x PER PARTITION LIMIT 2) == 4 rows.

     client_id | when                            | md
    -----------+---------------------------------+-----
             1 | 2020-05-06 12:00:00.000000+0000 | md1
             1 | 2020-05-05 22:00:00.000000+0000 | md1
             2 | 2020-05-06 19:00:00.000000+0000 | md2
             2 | 2020-05-06 01:00:00.000000+0000 | md2
    
    (4 rows)