I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python. Thanks!
Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master