database-design cassandra datastax denormalization nosql

use select with different attributes present in where clause Cassandra

I need to create a Cassandra column family with following attributes.

id bigint,
content varchar,
year int,
frequency int,

I want to get the content with highest frequency in a given year using this column family. Also when inserting data to table, for given content and year, I need to check if an id already exist or not. How can I achieve this with Cassandra?

I tried creating CF using

CREATE TABLE sinmin.word_time_inv_frequency (
id bigint,
content varchar,
year int,
frequency int,
PRIMARY KEY((year), frequency)
);

and then retrieved data using

SELECT id FROM word_time_inv_frequency WHERE year = 2010 ORDER BY frequency ;

But when using this, I can't check if entry is already existing for the (content,year) pair in the CF.

Solution

You can use a compound partition key to be able to select by (content, year) and still be able to order by frequency:

with this table you'd be able to

create table test2 (
 id bigint,
 content varchar,
 year int,
 frequency int, 
PRIMARY KEY((year, content), frequency)
);

Your query would work as follows:

select * from test2 where content ='puppies' and year=2014 order by frequency ;

It is best practice to maintain different tables to address querying needs. You can look into some of the integrations in datastax enterpise (search/analytics) for ad-hoc queries.

Please check the following videos on data modeling for an in-depth look at c* data modeling: https://www.youtube.com/playlist?list=PL75iJfNDd0_FI-Ia_b4z8aoDdOCQTi35I

By the way, is id your unique identifier for this dataset? In the table def you provided you'll overwrite your records if they have the same year and frequency. Make sure you get a unique identifier in your primary key.

Also consider using Clustering Order By in your table definition if you are always going to be pulling data in the same order.