I'm working with the movielens dataset and I have a column called 'genres' which has entries such as 'Action|War', 'Action|Adventure|Comedy|Sci-Fi'. I wish to count the number of rows that have the text 'Comedy' in them.
SELECT COUNT(*) FROM movielens.data_movies WHERE genres = 'Comedy' ALLOW FILTERING
But this counts only the exact instances of 'Comedy'. It does not count 'Action|Adventure|Comedy|Sci-Fi' which I want it to do. So I tried,
SELECT COUNT(*) FROM movielens.data_movies WHERE genres CONTAINS 'Comedy' ALLOW FILTERING
However, that gives me the error
Cannot use CONTAINS on non-collection column genres
From this it seems that there is no easy way to do what I'm asking. Does anyone know of a simpler solution?
So what you can do, is to create a CUSTOM
index on genres
.
CREATE CUSTOM INDEX ON movielens.data_movies(genres)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
Then this query should work:
SELECT COUNT(*) FROM movies
WHERE genres LIKE '%Comedy%';
However, if you're running a query across millions of rows over multiple nodes, this query will likely timeout. This is because Cassandra has to poll multiple partitions and nodes to build the result set. Queries like this don't really work well in Cassandra.
The best way to solve for this, is to create a table partitioned by genre
, like this:
CREATE TABLE movies_by_genre (
id int,
title TEXT,
genre TEXT,
PRIMARY KEY(genre,title,id));
This is of course also assuming that genres
is split-out by each individual genre. But then this query would work:
SELECT COUNT(*) FROM movies_by_genre
WHERE genre = 'Comedy';