Search code examples
apache-kafkakafka-topic

fixing kafka cluster with under-replicated partitions


We are having a problem with one of our kafka clusters. We have 6 nodes with v1.0, all the topics have a replication factor of 3 and 10 partitions/topic which seemed to be enough for us.

Due to a power failure, 3 of the nodes went down for a while, and now we have A LOT of topics which are reported as having under-replicated partitions.

The only solution (and seems to be the more accepted) we have seen on forums is to do a rolling restart until everything gets magically fixed, but i hope there is a better solution for this. Has anybody recovered from this situation? Network or cpu shouldn't be a problem to get in sync as it's not even near the limits.

Thanks a lot!


Solution

  • Finally we could recover the cluster deleting by hand many of the broken so we reduced the under-replicated partitions from about 4600 to around 1k.

    After that and also having all of them only in 2 of the nodes, we decided to do an ordered shutdown in both nodes and after that, the replication started again.

    I suppose there is kind of a bug that makes kafka stop replicating from nodes, but this made the trick.

    Update:

    Once the cluster is stable you can also try to rebalance the partitions broken between the available brokers. From my experience is better to generate small rebalance files instead of rebalance the full cluster as it gets usually stuck in the middle of the process (at least in old versions)