Search code examples
apache-kafkaapache-kafka-streams

Why sawtooth curve on file handles and cpu usage on our kafka cluster?


We are having an interesting file handles and cpu usage behavior on our Kafka Cluster that I cannot explain :) I'm not sure which information would be needed to figure out the reason, so I will list some (tell me if you are missing any):

  • 3 nodes in the cluster
  • all topics are replicated three times
  • ratio between delete / compacted topics is about 50/50 with a retention time of 7 days
  • delete topics (~16) mostly with 1 partition
  • compacted topics (~20) mostly with 16 partitions
  • all use topics default settings

In addition, we have 4 compacted topics (1 partition) with a very small segment_ms and retention_ms set to 1 minute. This topics are used as cache to serve the latest values.

Here is a metric showing the sawtooth behavior:

enter image description here

The file handle spices are about 7 days long and also seem to relate to the cpu usage. The default segment_ms (which we use for the majority of our topics) is 7 days long. Not sure if this relates.

Any ideas why this happens? Thanks!


Solution

  • Apparently, this interesting behavior is caused by our "compacted" topics. We replaced almost all "compacted" topics with "delete" and only kept those 4 that are really mandatory (as caching). Now the behavior is back to normal (as you can see for the last couple of days).

    enter image description here

    In Kafka, a topic consists of segments. A segment is only "garbage collected" once the last entry is gone. If a topic is compacted, there might be single entries (without further updates) blocking the whole segment from being "garbage collected" which leads to many "open file handles". With "delete" segments are garbage collected more constantly.