Search code examples
cassandracqlcql3datastax-java-driver

Cassandra - IN or TOKEN query for querying an entire partition?


I want to query a complete partition of my table. My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.

I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.

Now I'm wondering what's the most efficient way to query a complete partition of my data? According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.

Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);

So my question is: Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?

I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.


Solution

  • You say:

    Now I'm wondering what's the most efficient way to query a complete partition of my data? According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.

    but actually you'd query 24 partitions.

    What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:

    1. Run 1 query at time.
    2. Run 2 queries the first time, and then one at time to "prefetch" results.
    3. Run 24 queries in parallel.

    CASE 1

    If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.

    CASE 2

    If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.

    Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:

    ResultSet rs = session.execute("your query");
    for (Row row : rs) {
        if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
            rs.fetchMoreResults(); // this is asynchronous
        // Process the row ...
    }
    

    where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.

    You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.

    CASE 3

    If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).

    If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.