Search code examples
cassandracql

search within a cassandra column


I'm working with the movielens dataset and I have a column called 'genres' which has entries such as 'Action|War', 'Action|Adventure|Comedy|Sci-Fi'. I wish to count the number of rows that have the text 'Comedy' in them.

SELECT COUNT(*) FROM movielens.data_movies WHERE genres = 'Comedy' ALLOW FILTERING

But this counts only the exact instances of 'Comedy'. It does not count 'Action|Adventure|Comedy|Sci-Fi' which I want it to do. So I tried,

SELECT COUNT(*) FROM movielens.data_movies WHERE genres CONTAINS 'Comedy' ALLOW FILTERING 

However, that gives me the error

Cannot use CONTAINS on non-collection column genres

From this it seems that there is no easy way to do what I'm asking. Does anyone know of a simpler solution?


Solution

  • So what you can do, is to create a CUSTOM index on genres.

    CREATE CUSTOM INDEX ON movielens.data_movies(genres)
      USING 'org.apache.cassandra.index.sasi.SASIIndex'
      WITH OPTIONS={'mode':'CONTAINS'};
    

    Then this query should work:

    SELECT COUNT(*) FROM movies
    WHERE genres LIKE '%Comedy%';
    

    However, if you're running a query across millions of rows over multiple nodes, this query will likely timeout. This is because Cassandra has to poll multiple partitions and nodes to build the result set. Queries like this don't really work well in Cassandra.

    The best way to solve for this, is to create a table partitioned by genre, like this:

    CREATE TABLE movies_by_genre (
        id int,
        title TEXT,
        genre TEXT,
        PRIMARY KEY(genre,title,id));
    

    This is of course also assuming that genres is split-out by each individual genre. But then this query would work:

    SELECT COUNT(*) FROM movies_by_genre
    WHERE genre = 'Comedy';