Search code examples
apache-kafkaaws-msk

MSK Not Deleting Old Messages


I have three MSK clusters; dev, nonprod & prod. They all have the following cluster configuration - there is no topic level configuration.

auto.create.topics.enable=false
default.replication.factor=3
min.insync.replicas=2
num.io.threads=8
num.network.threads=5
num.partitions=1
num.replica.fetchers=2
log.retention.hours=100
replica.lag.time.max.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=true
zookeeper.session.timeout.ms=18000

Dev and Nonprod are clearing down messages older than 100 hours as defined in the log.retention.hours=100 setting.

We have a lot more traffic coming through our production cluster and old messages are not being removed. We have hundreds of thousands of messages older than 400 hours still on the cluster. I have thought about adding further config settings such as

segment.bytes
segment.ms

To roll the segments quicker as maybe a segment hasn't rolled yet and can't be marked for deletion - however this same config is working nicely in the other clusters albeit not receiving as much traffic.


Solution

  • So this turned out to be an issue with a Producer sending messages to Kafka in a US date format rather than UK. Therefore, it created messages that would appear to be timestamped in the future - hence not be older than 100 hours and eligible for deletion.

    To remove the existing message we set log.retention.bytes which prunes messages irrespective of the log.retention.hours setting. This caused the kafka topic to be pruned and delete the erroneous message - we then unset log.retention.bytes.

    Next we set the log.message.timestamp.type=LogAppendTime to ensure that messages are stamped with a queue time as apposed to the document time. This will prevent bad dates from producers causing this issue again in the future.