Apache Storm vs Apache Samza vs Apache Spark

I have worked on Storm and Spark but Samza is quite new.

I do not understand why Samza was introduced when Storm is already there for real time processing. Spark provides in memory near real time processing and has other very useful components as graphx and mllib.

What are improvements that Samza brings and what further improvements are possible?

Solution

This is a good summary of the differences and pros and cons.

I would just add that Samza, which actually isn't that new, brings a certain simplicity since it is opinionated on the use of Kafka as its backend, while others try to be more generic at the cost of simplicity. Samza is pioneered by the same people who created Kafka, who are also the same people behind the Kappa Architecture--primarily Jay Kreps formerly of LinkedIn. That's pretty cool.

Also, the programming models are totally different between realtime streams with Samza, microbatches in Spark Streaming (which isn't exactly the same as Spark), and spouts and bolts with tuples in Storm.

None of these are "better." It all depends on your use cases, the strengths of your team, how the APIs match up with your mental models, quality of support, etc.

You also forgot Apache Flink and Twitter's Heron, which they made because Storm started to fail them. Then again, very few need to operate at the scale of Twitter.