Search code examples
cassandramodelingnosql

How to model for repeated information on many records on cassandra


I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:

I have a generic table:

CREATE TABLE readings (
    key int,
    key2 int,
    time timestamp,
    name text,
    PRIMARY KEY ((key, key2) time)
)

This table has 700.000.000+ records. I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.

The queries I want to run that I can't now are:

  • Which records on the readings table were gotten from a given source?
  • What is the source of a given record?

A solution would be for me to create a table such as:

CREATE TABLE readings_per_source(
    source text,
    key int,
    key2 int,
    time timestamp,
    PRIMARY KEY (source, key, key2, time)
)

which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.

If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.

How does one go about modelling this in cassandra?


Solution

  • Your schema

    CREATE TABLE readings_per_source(
        source text,
        key int,
        key2 int,
        time timestamp,
        PRIMARY KEY (source, key, key2, time)
    )
    

    is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots

    For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table

    For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.

    Do you realize that this query can potentially return tens of millions of records ?

    If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column

    CREATE TABLE readings (
        key int,
        key2 int,
        time timestamp,
        name text,
        source text,
        PRIMARY KEY ((key, key2), time)
    )
    
    CREATE CUSTOM INDEX source_idx ON readings(source) 
    USING 'org.apache.cassandra.index.sasi.SASIIndex'
    WITH OPTIONS = {
         'mode': 'PREFIX', 
         'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
         'case_sensitive': 'false'
    };
    

    Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)