Search code examples
apache-stormapache-samza

Where do Apache Samza and Apache Storm differ in their use cases?


I've stumbled upon this article that purports do contrast Samza with Storm, but it seems only to address implementation details.

Where do these two distributed computation engines differ in their use cases? What kind of job is each tool good for?


Solution

  • The biggest difference between Apache Storm and Apache Samza comes down to how they stream data to process it.

    Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.

    Here's Apache Storm's architecture: enter image description here

    Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are an ordered sequence where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.

    Here's Apache Samza's architecture: enter image description here

    Read more about the specific ways each of the systems executes specifics below.

    USE CASE

    Apache Samza was created by LinkedIn.

    A software engineer wrote a post siting:

    It's been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers. Our largest Samza job is processing over 1,000,000 messages per-second during peak traffic hours.

    Resources Used:

    Storm vs. Samza Comparison

    Useful Architectural References of Storm and Samza