I need to have your opinion on my design of data model for my project. My project is real-time recommendation system. There are some collections of recommendation algorithms. it means something like this:
collection1 { algorithm1, algorithm5, algorithm6 }
collection2 { algorithm5, algorithm6, algorithm7, algorithm8 }
etc.
and I need to store data like - success, probability of choosing, score and so on to every algorithm (every 2 minutes) in collection. So I have chosen Cassandra like my data storage because it works fine with time series. I need to have my data stored to be displayed later in some graphs and charts. Do you think that my solution of data model is ok? I made it in this way:
CREATE TABLE algorithm_by_collection_and_date (
algorithm_id text,
collection_id text,
date text,
event_time timestamp,
score double,
probability double,
PRIMARY KEY ((algorithm_id,collection_id,date),event_time)
);
So it is designed like row partitioning by adding data to the row key to limit the amount of columns (by date) I per algorithm in collection.
What do you think about this? Thanks, Jan
I would have this structure - it will allow you to normalise your structure and make it a cleaner design. I have rushed this, please add in the correct data types for the columns and also the referential integrity constraints.
CREATE TABLE algorithm
(
algorithmId uuid PRIMARY KEY,
algorithmName text
)
CREATE TABLE collection
(
collectionID uuid PRIMARY KEY,
collectionName text
)
CREATE TABLE algo_collection
(
algoCollectionID uuid PRIMARY KEY
collectionID
algorithmID
)
CREATE TABLE recommendation
(
algoCollectionID
date
event_time,
score,
probability
)