apache-spark apache-kafka apache-storm mosquitto broker

Architecture for stream analytics. Which broker I need?

for research purpose I'm studying an architecture to do real-time (and also offline) data analytics and semantic annotation. I've attached a basic schema: I have some sensors linked to a raspberry pi 3. I suppose can handle this link with a mqqt broker like mosquitto. However, I want to collect data on raspberry, do something, and forward them to a cluster of commodity hardware to perform real time reasoning with Spark or Storm (any hint about which?). Then these data have to be stored in a NoSql db (Cassandra or HBase probably) accessible to an Hadoop cluster to execute batch reasoning, semantic data enrichment on them and re-store on same db. Therefore clients can query system to extract useful informations.

Which technology should I use in the red block? My idea is for MQQT but Kafka maybe could fit better my purposes?

Solution

Spark vs Storm

Spark is the clear winner right now between Spark and Storm. At least one reason is that Spark is much more capable of handling large data volumes in a performant way. Storm struggles with processing large volumnes of data at a high velocity. For the most part the Big data community has embraced Spark, at least for now. Other technologies like Apex, and Kafka Streams are making waves in the Stream Processing space.

Kafka Producing to Raspberry Pi

If you choose the Kafka path keep in mind that the Java client for Kafka is by far, in my experience, most reliable implementation. However, I would do a proof of concept to ensure that there won't be any memory issues since the Rasberry Pi doesn't have a lot of RAM on it.

Kafka At the Heart

Keeping Kafka in your RED box will give you a very flexible architecture moving forward because any process: Storm, Spark, Apex, Kafka Streams, Kafka Consumer can connect to Kafka and quickly read the data. Having Kafka at the heart of your architecture provides you with a "distribution" point for all your data since its very fast but also allows for data to be permanently stored there. Keep in mind that you can't query Kafka, so using it will require you to simply read the messages as fast as you can to populate other datastores or to perform streaming calculations.