Search code examples
cassandrascylla

Storing arrays in Cassandra


I have lots of fast incoming data that is organised thusly;

  • Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
  • The data arrays themselves are not necessarily written in order.
  • The length of the arrays may vary.
  • The data is either read as an entire array at a time so makes sense to store the entire thing together.

The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.

For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.

In Cassandra/Scylla it looks like I have the options of either:

  1. Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
  2. or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
  3. or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.

What other options do I have?

Thanks


Solution

  • Option 1, seems to be a good match: I assume each logical object have an unique id (or better uuid) In such a case, you can create something like

    CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
    

    Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.

    This allows

    • fast retrieve of the entire "array", even a big one, using paging
    • fast retrieve of an index in an array