apache-spark distributed-computing google-cloud-dataflow

Google Dataflow vs Apache Storm

Reading Google's Dataflow API, I have the impression that it is very similar to what Apache Storm does. Realtime data processing through pipelining flow. Unless I completely miss the point here, instead of building bridges on how to execute pipelines written against each other, I'd expect something different from Google and not reinventing the wheel. Apache Storm is already well-placed and usable with any programming language. What is the real value for doing something like that?

Solution

Thank you for your interest in the Dataflow programming model! It is true that both Dataflow and Apache Storm support stream processing, but there are important differences:

Dataflow supports both batch and streaming computation under the same "windowing" API, while Storm, as far as I know, is specifically a streaming system.
The API for defining the topology of the computation is very different in Dataflow and Storm. Dataflow API largely mimics FlumeJava: you manipulate logical PCollection objects (parallel collections; you can think of them as logical datasets) like you would manipulate real collections, and build new collections from the results of applying different parallelizable operations (such as ParDo) to other collections. On the contrary, in Apache Storm you build the network of the computation directly from "spouts" and "bolts"; there is no explicit notion of a logical dataset or a parallel operation that I'm aware of.
The logical representation of a pipeline in Dataflow allows the framework to perform optimizations similar to the ones done by query optimizers in database systems, e.g. avoid or introduce materialization of certain intermediate results, move around or eliminate group-by-key operations, etc. You can see an overview of these optimizations in the FlumeJava paper. This is useful both in batch and streaming modes.
The consistency guarantees between Dataflow's and Storm's streaming computation model are different. This is actually a fascinating topic! I suggest to read the Millwheel paper (which is what the streaming part of Dataflow is based on) for an overview of the fault tolerance and consistency concerns in a streaming system. I believe the paper briefly compares Millwheel with Storm too. You can find a more extensive discussion of the importance of consistency guarantees in streaming systems, and the power of the consistency given by Dataflow, in the talk Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda Architecture.
One of the main value propositions of Dataflow as part of the Google Cloud Platform is being zero-hassle: you do not need to set up a cluster, set up a monitoring system, etc.: you simply submit your pipeline to the cloud API and the system allocates resources for it, executes your pipeline using them, monitors it for you. This is perhaps not related to your question about the similarity of the programming model, though.