Search code examples
cassandracqlcql3

How to properly select partition keys that match an index


I try to select the partition keys that contain at least one row with a specific value in an indexed column.

With the current solution all the other requirements are met:

  • Ability to select reports based on their office.
  • Given an office, ability to select using type and date range.
  • No needs to select reports based on date without emission office and/or reports type.

Last, i need the ability to select all the offices where a certain user has created a report. Based on cassadra documentation i have added an index on the user column.

The table is defined as:

create table report(
    office uuid,
    type text,
    insert_date timestamp,
    ...
    created_by uuid,
    ...
    primary key(office, type, insert_date));
create index created_by_idx on report (created_by);

Using that index, if i'm not wrong, is like having a secondary table described as follow:

create table report2(
    created_by uuid,
    office uuid,
    type text,
    insert_date timestamp,
    ...
    primary key(created_by ,office, type, insert_date));

I can successfully run a query like: select office from report where created_by = ? but that results in multiple row with the same office key, that it is correct: each user can create multiple report in each office.

Now i filter duplicated offices at software level, but i'm asking myself if it was possible to filter that data directly during extraction.

I tried: select distinct office from report where created_by = ? that results in SELECT DISTINCT with WHERE clause only supports restriction by partition key and/or static columns.

Then i tried: select office from report where created_by = ? group by office that give me the correct results, but raise a warning: Aggregation query used without partition key

Could this be a problem somehow? How handle cassandra a query like this and can this warning, in this case, be ignorated? And finally, is really a better choise use a query like this against a select * ... whit the same where clause?


Solution

  • There is a reason why Cassandra doesn't support the feature you describe - it's because it can be inefficient:

    First, the secondary-index must indeed list, as you noted, all the matching row keys and not just distinct matching partition keys. This is because you may ask for all of the rows and not just the distinct partition keys, so the data must be there. Another reason why this data must be kept in the index is that each individual row may be deleted (or expired) individually, so Cassandra needs to keep track of all of them to decide whether the overall partition still exists, or no longer does.

    Now, since we already have all rows' keys listed in the list of search results, outputting just the "distinct" partition keys is an inefficient operation... If you have a million matching rows in one partition, Cassandra will need to scan all of them before outputting just one result. Cassandra usually doesn't let the user shoot himself in the foot with an inefficient operation (e.g., see how "ALLOW FILTERING" is needed to explicitly allow inefficient filtering in SELECT queries).

    That being said, it makes sense that a future Cassandra version should allow the "SELECT DISTINCT" request you requested, perhaps requiring the user to say "ALLOW FILTERING" to acknowledge it can be slow. Also, an alternative implementation might involve slower updates (each row update will checking the liveness of the partition) but then quick reads. Scylla considered such an implementation but hasn't implemented it either (see https://github.com/scylladb/scylla/issues/3435).