Search code examples
apache-kafkaarchive

How to archive, not discard, old data in Apache Kafka?


I'm currently assessing Apache Kafka for use in our technology stack. One thing which may become critical is a contractual or legal requirement to be able to audit the system's behaviour, retaining this audit information for as much as a year.

Given the volume of data we process we will, most likely, need to cold-store this rather than simply partitioning the data and setting a long retention period. Cold-store here means storing in Amazon S3 or multiple locally held TB HDDs.

We could of course set up a logger against every topic. Yes.

But this feels like it should be a solved problem to which I just can't find a documented solution.

What's the best way of archiving old data from Apache Kafka rather than simply discarding it?


Solution

  • You could use the S3 sink connector to stream the data to S3, and then set the retention period on your topics as required to age-out the data.