Search code examples
apache-sparkapache-kafkaapache-superset

Understanging kappa architecture with apache superset


There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.

Let's imaging you want to implement a kappa architecture involving the following tech stack:

  • Apache Kafka
  • Apache Spark
  • Apache Superset

Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.

But now you want to see how you would do this with a kappa architecture and you add kafka and spark. You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work? Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.

How can you write queries against the data that is in kafka (which is in fact the live data)?

On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?

Or is this not the intended way of connecting the things?


Solution

  • If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.

    Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.

    But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.

    Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.