Search code examples
amazon-s3apache-kafkadata-pipeline

Is there a configuration in Kafka to write multiple records to one S3 object?


I'm using an S3 Sink Connector to write records to S3 from Kafka. Eventually I will be using Kafka to capture CDC Packets from my Database and then writing these packets to S3.

However, I don't want every single CDC Packet, which in Kafka will be a single record, to be written to a separate S3 Object. I would want to configure a size or time based condition so that all records every X seconds or Y bytes are written to one S3 object.

I haven't been able to find any thing that would write records to one object, however I have found Kafka Consumer properties fetch.min.bytes and fetch.max.wait.ms which do write the objects every X seconds or Y bytes - but multiple records are still written as separate objects.


Solution

  • You shouldn't use a basic consumer for this (I mean, you could, but then you'd obviously need to write all "batching" logic you're requesting yourself).

    The S3 Kafka Connect sink already does this via flush.size (record count, not bytes), and/or time-based partitioner.

    The Secor project is also something to look at.