Search code examples
cassandraanalyticsdata-modelingcql3

Is a supercolumn in Cassandra the right model for a 1:n relationship in a traditional RDBMS


First of all, please forgive me, if this question is no fit for stack overflow- I honestly wouldn't know who or where else to ask.

I have the following situation: in the transactional system we have a table with business processes that get executed by users. These process might be something like "decommission car with ID xzy". The process itself involves arbitrary tasks, that might or might not be needed to execute to fulfill the process.

Now, in the Cassandra DB we want to setup a monitoring of the processes. That is, number of processes per day, number of open processes, number of canceled processes and most important: in what step was the process stopped, canceled or put on hold. Also the amount of time it took for a step until it was finished is of interest.

Please be aware, that all these processes and steps are manually done and reported back in the system.

Now I'm wondering how to best model this.

I'm thinking about a supercolumn with the process id (stored as uuid) and information about the job/process itself plus rows for the steps each with the step id (UUID as well) as column key.

Would that be the right model?

I'm also unsure about the time values. I have a start and end time for the overall process plus start and end times for each step. Given the fact, that most queries will be time based, as in "how many decommission jobs were started today?", the time values are of vital importance . Should I use a secondary index or is a table with the time as column key a better fit?

Maybe someone can guide me in the right direction as I am still quite new to Cassandra

Kind regards,

Chris


Solution

  • You don't need super columns for that (which are depracted btw). You should use clustering keys instead. You can get an idea on how they work from this response. The following primary would allow you to track a status for each process, task and event time: primary key (process_id, task_id, time). Afterwards you can query it by either only process_id or both process_id and task_id.