Search code examples
apache-kafkaapache-kafka-streams

Stateful Kafka Stream - How to restore state?


If I am running my stream application - appA on Machine A and then I moved it to Machine B; will it remember the earlier state?

When I write simple consumer it remembers the last offset and it gets stored in __consumer_offsets itself on Broker. So no matter where I start the Consumer it will pick up from that place.

Is there such a construct for stateful stream processing applications? If I am calculating the continuous Profit and Loss of my portfolio I need to start from where it was the last run and then start aggregating new transactions to that earlier P&L number. I cannot afford to process all messages again from the start of time. I have been having a hard time in finding an article around this that explains how to solve this problem.


Solution

  • No, it won't remember state unless you move the statestore as well (state.dir configuration).

    The changelog topic will need read from the earliest offsets to rebuild the state.

    There's presentations about running Kafka Steams in Kubernetes that cover some aspects of this, since Kubernetes can stop and relocate its pods... But kubernetes also has volume management features that may not be available in your scenario.

    It might therefore be best to run your job on both machines to start, then you have fault tolerance, high availability with a warm standby replica / partitioned state.