I'm currently assessing Apache Kafka for use in our technology stack. One thing which may become critical is a contractual or legal requirement to be able to audit the system's behaviour, retaining this audit information for as much as a year.
Given the volume of data we process we will, most likely, need to cold-store this rather than simply partitioning the data and setting a long retention period. Cold-store here means storing in Amazon S3 or multiple locally held TB HDDs.
We could of course set up a logger against every topic. Yes.
But this feels like it should be a solved problem to which I just can't find a documented solution.
What's the best way of archiving old data from Apache Kafka rather than simply discarding it?
You could use the S3 sink connector to stream the data to S3, and then set the retention period on your topics as required to age-out the data.