Search code examples
apache-kafkaapache-kafka-streamsstateful

How to clean up Kafka with Kafka Streams on rolling basis?


Data will get accumulated in both Kafka topics and Kafka Streams state stores over time, which will cause storage costs to grow. Is it safe to set up retention period for Kafka topics, for example a week, when data is moved between topics with Kafka Streams? My concern is if deletion of older data can impact data consistency and recoverability with Kafka Streams.

Also how can older data be safely purged from Kafka Streams state stores?


Solution

  • In KafkaStreams stores are backed by topics with a -changelog suffix. When a fresh application starts its store is populated from this topic. As such it's important to set retention time on changelog topics to -1 and clean them up on your own.

    For standard topics you can rely on retention time for cleanup. However keep in mind one important thing: retention time is the minimum time of the record life, not the maximum. It might get cleaned up in a day or later. Also keep in mind that when cleaning up topics by publishing tombstones that these tombstones persist for much longer on the topic.

    The general approach on cleaning up your state store: utilize the punctuation api.

    The idea is that you'll utilize the punctuator to iterate the store and delete any records which you'd consider worthy of clean up.

    An alternate approach is to use TimestampedKeyValueStore which stores your values and timestamp of latest modification.