Search code examples
cassandrakylin

Cassandra and aggregated data


We have a "legacy" SQL Server-based application which keeps OLTP data (sales):

  • OLTP data structure is very complex
  • Still we must keep it as a source for reports
  • Reports over OLTP structures are very slow
  • So we prepare and keep actual "OLAP"-views, say, sales per day, each view is actually a table in MS SQL database

Main problem: when we need a new view it takes a lot of time to scan all existing OLTP data.

Now we want to migrate to Cassandra, should we use same approach to achive same goals or:

  • May be we better use tools like Spark/Kylin, can they do things like this?
  • May be approach can be changed somehow?

Solution

  • It might not be the answer that you want to look for. However, I just want to share our experience with cassandra and aggregated data. In our project, we need to collect data from servers across the world and perform aggregation accordingly. Some of metrics are message per hour per server, per geographic regions, etc. So, once new piece of data coming in, it will either automatically kick off batch process to perform aggregation or insert data into multiple tables/views. We are using apache-spark as processing engine, additionally, we also make use of some concepts in cassandra such as materialized view, secondary index, custom trigger based on specific use case. One important point while designing data model is to forget about NF, basically, we don't need this in NoSQL in general.

    In short, I can say that migrating from traditional database to NoSQL database might be troublesome at first. But the result at the end is quite satisfactory in term of performance and availability.