I have the following data pipeline:
- A process writes messages to Kafka
- A Spark structured streaming application is listening for new Kafka messages and writes them as they are to HDFS
- A batch Hive job runs on a hourly basis and reads the newly ingested messages from HDFS and via some medium complex INSERT INTO statements populates some tables (I do not have materialized views available). EDIT: Essentially after my Hive job I have as result Table1 storing the raw data, then another table Table2 = fun1(Table1), then Table3 = fun2(Table2), then Table4 = join(Table2, Table3), etc. Fun is a selection or an aggregation.
- A Tableau dashboard visualizes the data I wrote.
As you can see, step 3 makes my pipeline not real time.
What can you suggest me in order to make my pipeline fully real time? EDIT: I'd like to have Table1, ... TableN updated on real time!
- Using Hive with Spark Streaming is not recommended at all. Since the purpose of Spark streaming is to have low latency. Hive introduces the highest latency possible (OLAP) since at backend it executes MR/Tez job (depends on hive.execution.engine).
Recommendation: Use spark streaming with the low latency DB like HBASE, Phoenix.
Solution: Develop a Spark streaming job with Kafka as a source and use the custom sink to write the data into Hbase/Phoenix.