Search code examples
cassandracassandra-2.0

Data Versioning in Cassandra with CQL3


I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.

I have this basic table:

CREATE TABLE documents (
    itemid_version text,       
    xml_payload text,
    insert_time timestamp,
    PRIMARY KEY (itemid_version)
); 

itemid is actually a UUID (and unique for all documents), and version is an int (version 0 is the "first" version). xml_payload is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.

As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:

  • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)
  • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version
  • Use-Case 3: user needs the version history of a single (1) doc
  • Use-Case 4: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic.

First one's straightforward:

$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

Now to satisfy the 2nd and 3rd use cases, I am adding the following table:

CREATE TABLE document_versions (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);

The new records will be added as new docs and new versions of existing docs are created.

Now we have this (use case #2):

$latest_itemid, $latest_version = csql("select itemid, 
    version from document_versions where item_id = {0} 
    order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

and this (use case #3):

$versions = csql("select version from document_versions where item_id = {0}" 
    -f $itemid)

For the 3rd requirement, I am adding yet another table:

CREATE TABLE latest_documents (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
)

The records are inserted for new docs, records are updated for existing docs.

And now we have this:

$latest_itemids, $latest_versions = csql("select itemid, version 
    from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())

foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
    $itemid_version = concat($latest_itemid, $latest_version)
    $latest_docs.append(
        cql("select * from documents where itemid_version = {0};" 
        -f $itemid_version))        

Now I hope it's clear why I concatenated itemid and version to create an index for documents as opposed to creating a compound key: I cannot have OR in the WHERE clause in SELECT.

You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.

Am I on the right track here? There are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:

  • I feel that the primary key for documents should be a composite of (itemid, version) but I can't satisfy use case #4 (return a list from a query)...I can't possibly use a separate SELECT statement for each document due to the performance hit (network overhead)...or can (should) I?
  • It requires 2 trips to get a document if the version is not known beforehand. This is probably a compromise I have to live with, unless maybe there's a better way.

Solution

  • This is actually very similar to your solution except that you can store all the versions and be able to fetch the 'latest' version just from one table (document_versions).

    In most cases, I think you can get what you want in a single SELECT except use case #2 which requires fetching the most recent version of a document where a SELECT is first needed on document_versions.

    CREATE TABLE documents (
            itemid_version text,
            xml_payload text,
            insert_time timestamp,
            PRIMARY KEY (itemid_version)
    );
    
    CREATE TABLE document_versions (
            itemid text,
            version int,
            PRIMARY KEY (itemid, version)
    ) WITH CLUSTERING ORDER BY (version DESC);
    
    
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
    
    INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
    INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
    INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
    INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);
    
    • Use-Case 1: user needs to get the single (1) doc, user knows the item id and version (not necessarily the latest)

      SELECT * FROM documents WHERE itemid_version = 'doc1-2';
      
    • Use-Case 2: user needs to get the single (1) doc, user knows the item id but does not know the latest version (you need to feed the concatenated itemid + version from the result of the first query into the second query)

      SELECT * FROM document_versions WHERE itemid = 'doc2' LIMIT 1;
      
      SELECT * FROM documents WHERE itemid_version = 'doc2-2'; 
      
    • Use-Case 3: user needs the version history of a single (1) doc

      SELECT * FROM document_versions WHERE itemid = 'doc2';
      
    • Use-Case 3: user needs to get the list (1 or more) of docs, user knows the item id AND version (not necessarily the latest)

      SELECT * FROM documents WHERE itemid_version IN ('doc1-2', 'doc2-1');