Search code examples
analytics360-degreesapache-kudu

Any suggestions for analytical columnar DB which can be modified?


I need to build a customer 360 degree database, which requires:

  • A wide-column table, each customer is one row, with lots of columns (says > 1000)
  • We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
  • We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.

For these requirements, I think an modifiable columnar DB would be a perfect fit: it can be queried and aggregated by columns which is optimal for analytics, it can be updated for several million changes throughout the day. The most identical project I have found is Apache Kudu, but its limitation of 300 columns is a big turn-off, we have more than 1000.

And we prefer a open-source project.

Any suggestions ?


Solution

  • I will answer my own question, since our solution works fine now.

    Instead of having a unified DB for both analytics and OLTP workload, we separate the workload into 2: analytics workload will be served by Parquet tables in HDFS, and OLTP one will be served by HBase.

    Of course we have to duplicate (part of) the customer data, but with a not-so-much cost of storage and computing capacity that we are willing to pay.