Search code examples
amazon-s3apache-kafkaapache-kafka-connects3-kafka-connector

Partitioning with key in kafka connect s3 sink


Can we partition our output in s3 sink connector with key? How can we set in connector config to just hold latest 10 record of each key or just hold data of 10 minutes ago? or partitioning with key and time period.


Solution

  • You'd need to set store.kafka.keys=true for the S3 sink to store keys, by default, but those will be written to unique files separately from the value, and within whatever partitioner you've configured.

    Otherwise, the FieldPartitioner only uses the value of the record, Therefore, you'd need an SMT to move the record key into the value in order to partition on it.

    Last I checked, there is still an open PR on Github for a Field and Time partitioner.


    The S3 sink doesn't window/compact any data, it'll dump and store everything. You'll need an external process such as a Lambda function to cleanup data over time