cassandra data-modeling datastax denormalization nosql

Best Data Model for Unique Data Selection and Assignment in Cassandra

What's the best Cassandra data model and query for the following situation?

Our system is responsible for uniquely assigning serial numbers to our toasters, when each one is created at our factory.

We have multiple models of toasters, each type uniquly identified by a different UPC.
A serial number can never be assigned to more then one toaster model (UPC specific).
It's not crucial to assign serial numbers in order.
At least once a day, we need to find out how many un-assigned serial numbers we still have for each UPC.
Our system is supplied by serial numbers from time to time by another system in large batches by their UPC.

Performance Requirements:

Finding an un-assigned serial number for a UPC must be fast.
Inserting new serial numbers does not need to be fast.
Counting does not need to be fast.

Our dataset is about 10M serial numbers today growing about 1M annually.

We are currently using Cassandra 2.0.x and will soon migrate to 2.1.x.

Solution

Ok, I put a bit of thought behind this one. There are a couple things that are tricky about this scenario:

Serial numbers come in, and are assigned (but not right away). With growth of 1 million per year, that works out to ~2800 new serial numbers per day. Queuing those up (upserting them when they come in and deleting them when they are assigned) will create a lot of tombstones (essentially 2800 per day).
With 50 UPCs relating to 10 million serial numbers, that works out to 200k serial numbers per UPC (assuming even distribution). This means that we can't store the UPC-to-serial numbers relationship in a collection (max size of 65536 items).

I assume that you'll want to be able to figure out which serial numbers are tied to which models, and which models have which serial numbers. For that, I'd go with two lookup tables:

CREATE TABLE serialNumbersByUPC (
  modelUPC uuid,
  insertTime timeuuid,
  serialNumber text,
  PRIMARY KEY (modelUPC,insertTime))
WITH CLUSTERING ORDER BY (insertTime DESC);

CREATE TABLE UPCsBySerialNumbers (
  modelUPC uuid
  insertTime timeuuid,
  serialNumber text,
  PRIMARY KEY (serialNumber));

Note that you could also key serialNumbersByUPC with serialNumber as a clustering key (instead of insertTime). But timeuuids are unique (so there won't be collisions on serialNumbers), and clustering by insertTime has the added benefit of allowing you to sort by date/time. Of course, you'll want to make sure that you upsert to both of those tables when assigning a serial number to a UPC.

For the unassigned serial numbers, it might be best to use a queuing system like HornetQ or RabbitMQ. That way you could just pull the new serial numbers off of the queue, and assign them as-needed. The reason I suggest this, is that using Cassandra to queue-up transient data has been identified as an anti-pattern.

Of course you could decide not to heed the above warning, and insist on using Cassandra for that functionality. If so, then this is how I would store unassigned serial numbers in Cassandra:

CREATE TABLE unassignedSerialNumbers (
  dateBucket bigint,
  serialNumber text,
  insertTime timeuuid,
  PRIMARY KEY ((dateBucket),insertTime))
WITH compaction = {'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy'}
AND gc_grace_seconds = 86400;

A few things about this solution:

I'm partitioning on datebucket as I'm not sure how fast you assign the 2800 serial numbers that come in each day. It's possible that you may want to query just the numbers that came in today, or yesterday. I've created it as a bigint, but you can use whatever size bucket you wish (ex: "20150416" would partition serial numbers that came-in on April 16th, 2015 together).
If you find that you assign serial numbers fast enough that you don't need to partition by datebucket, then I wouldn't worry about that table getting big enough to hinder query performance. Sure, your deletes will create tombstones that you query will have to contend with, but that should be helped with my last two points.
I'm clustering on insertTime for the same reason as I did in the serialNumbersByUPC table.
I'm using the DateTieredCompactionStrategy for this table. This strategy will keep rows written at the same time in the same SSTABLE files on-disk. This becomes important for performance as you delete and write new data.
gc_grace_seconds is being set to 1 day instead of 10. This will force tombstoned rows to be garbage-collected every day. The drawback of this setting, is that if you have a node go down you need to bring it back within 1 day of it going down to pick-up the deletes. If you don't you'll need to run a full repair or risk your deleted serial numbers "coming back from the dead."

You'll also want to read up on the DateTieredCompactionStrategy. There may be some other options with it that might make sense for you to set.

Let me know if you have any questions, or if I missed anything.