Search code examples
cassandracql3datastax-java-driver

Cassandra - join two tables and save result to new table


I am working on a self-bi application where users can upload their own datasets which are stored in Cassandra tables that are created dynamically. The data is extracted from files that the user can upload. So, each dataset is written into its own Cassandra table modeled based on column headers in the uploaded file while indexing the dimensions.

Once the data is uploaded, the users are allowed to build reports, analyze, etc., from within the application. I need a way to allow users to merge/join data from two or more datasets/tables based on matching keys and write the result into a new Cassandra table. Once a dataset/table is created, it will stay immutable and data is only read from it.

user table 1

  1. username
  2. email
  3. employee id

user table 2

  1. employee id
  2. manager

I need to merge data in user table 1 and user table 2 on matching employee id and write to new table that is created dynamically.

new table

  1. username
  2. email
  3. employee id
  4. manager

What would be the best way to do this?


Solution

  • The only option that you have is to do the join in your application code. There are just few details to suggest a proper solution.

    Please add details about table keys, usage patterns... in general, in cassandra you model from usage point of view, i.e. starting with queries that you'll execute on data.

    In order to merge 2 tables on this pattern, you have to do it into application, creating the third table (target table ) and fill it with data from both tables. You have to make sure that you read the data in pages to not OOM, it really depends on size of the data.

    Another alternative is to build the joins into Spark, but maybe is too over-engineering in your case.