Search code examples
streamingapache-stormhadoop-streaming

stateful and stateless streaming processing


While starting to learn streaming processing, I hear the following two technical items: stateful streaming processing, and stateless streaming processing, what are the difference between them? I heard storm is stateless while storm trident is stateful, so in practice, where to use storm and where to use storm trident?


Solution

  • The difference between the two is, at a very high level, in the kind of operation you have to perform on them.

    Some operations are stateless, that is, you process a record at a time. Think of a bank teller, that processes a stream of customers, one at a time. Each customer is a new unit of work that does not depend on the previous.

    A stateful operation is like hiring a new employee. You have a stream of people coming for interviews, but if you hire them or not, depends on your state, that is, what positions you have open.

    For example, let's say you're processing web logs. If you want to know how many users are looking at a page per second, your processing is almost stateless: every second you calculate how many users came per page. Each new second, you don't care about the result of the previous second. That is a stateless operation.

    Let's say that instead you want to calculate a forecast of how many users you'll have in the next second. You want to average the last 10 minutes, so you need to keep a queue with the last 10 * 60 seconds - that's the state you need to keep for your processing, and you need to update it every second, to keep the most recent 10 minutes of state. That's of course a stateful operation. A simpler stateful operation is just counting the total number of page view since the beginning of the site.

    One critical difference between the two operations is that if the stream stops and you reset the system, you gotta take care of saving the state. A stateless operation does not have any state to save so it's generally simpler.