Search code examples
apache-kafkaapache-kafka-streams

Purpose of statestore and changelog topic in kafka streams?


I have a kafka stream application in which it is using stateStore (backed by RocksDB).

All what stream thread is doing is getting data from kafka topic and putting the data into state-store. (There is other thread which read data from statestore and does business logic processing).

I observed it creates a new kafka topic "changelog" because of stateStore.

But I didn't get what purpose "changelog" kafka topic serves?

  • Why is it (changelog) needed?
  • What is relationship between statestore and "changelog" kafka topics?
  • Who puts data into this topic? ("changelog")

Solution

  • When you enable change logging for a state store, Kafka Streams captures changes to the state and writes them to a changelog topic in Kafka. This changelog topic acts as a durable and fault-tolerant storage for the state, allowing the state to be restored in case of application restarts or failures.

    Lets take word count example.

    Initial State:

    • Word: "hello", Count: 1
    • Word: "world", Count: 1

    Change Log Entries:

    When a word is processed multiple times, the state store updates the count for that word, and these updates are written to the changelog topic.

    • Update for "hello":
      • Word: "hello", Count: 2
    • Update for "world":
      • Word: "world", Count: 2
    • Update for "hello" again:
      • Word: "hello", Count: 3

    Changelog Topic:

    The changelog topic for the word-count-store might contain records like the following

    • Key: "hello", Value: 1 (Initial state)
    • Key: "world", Value: 1 (Initial state)
    • Key: "hello", Value: 2 (Update)
    • Key: "world", Value: 2 (Update)
    • Key: "hello", Value: 3 (Update)

    Restore State:

    If the Kafka Streams application restarts or fails over to another instance, it can restore the state of the word-count-store by replaying the changelog topic from the beginning. This ensures that the state is consistent and up-to-date across application instances.

    Compact Topics:

    To optimize storage and reduce the volume of change log data, it can be configured to use log compaction. This ensures that only the latest update for each key is retained in the changelog topic, allowing the state to be fully restored while minimizing storage requirements.