Search code examples
apache-kafkakafka-producer-api

Replicate file over kafka and prevent duplicate data


I'm interested in publishing a file's contents over a kafka channel in realtime (I can do this in python) but I'm wondering what strategy might be effective to prevent sending duplicate data in case my publisher crashes and I need to restart it? Is there anything in kafka that can help with this directly or must I explicitly track the file offset I've published so far?

I suppose another way might be for the publisher to bootstrap the data already published and count the bytes received then file seek and recover?

Are there any existing scripts or apps that handle this already I can perhaps leverage instead?


Solution

  • Instead of publishing it yourself, I strongly recommend using Kafka Connect. In addition of not having to write a custom made code, the connectors could also support the "exactly-once" feature for you.

    More details about connectors can be found here: https://www.confluent.io/product/connectors/