Search code examples
apache-kafkakafka-producer-api

Design: send duplicate states into Kafka topic


I'm working on a side project where I ingest transportation data into a kafka cluster. The data comes from my city's public API. For example : each road works in the city.

I'm fetching the road works every few hours. But there is no timestamp returned by the public API, so I have no way to easily tell which road works are new or have been modified recently. Most of the time the content returned by the API has not moved since last time. I use the roadwork id as the topic key and I activated log compaction, so having a lot of duplicates does not scare me as I'm sure the last state of each work will be kept.

But given the high number of duplicate and the fact that I'm only interested in the last version, is this ok ? Should I try to detect the new/modified roadwork and only push thoses ? Is there a way to do this directly in Kafka ?


Solution

  • Kafka's log compaction is a very good fit to your use case. Alternatives would mean to write code on your own while adding additional complexity.

    As you have already noted, when enabling log compaction, it is important to remember that at least the last state of each key (roadwork) is kept in the topic. You will still find duplicates.

    In order to minimise duplicates and therefore keep the overall volume low, you can tweak the available topic configurations. Most notably I suggest to

    • decrease the min.cleanable.dirty.ratio (which defaults to 0.5) to have more frequent cleanings. However, keep in mind that this will lead to inefficient cleanings while using more resources.

    • reduce max.compaction.lag.ms (which defaults to MAX_LONG) to reduce the maximum time a message will remain ineligible for compaction in the log.

    • set cleanup.policy=delete,compact if your application can afford to loose older messages. In that mode, both cleanup policies will be activated and you can keep at least the latest state for each key for a given retention time (or even byte size).

    Also, if you are concerned with volume size, apply a compression.type in your producer. Since Kafka version 2.2.0 you have zstd available which usually helps to reduce the byte size significantly.