Constant ISR shrinking and expanding

we have a Apache Kafka cluster with 5 brokers and 3 zookeepers. Zokeeper is version 3.14.3 and the brokers are 2.0.0. I've been trying for a long time now to understand why do the brokers get disconnected from the cluster - I'm getting dozens of "shrinking ISR from x,y to x" and a few seconds after fozens of "expanding ISR from x to x,y" for each partition of every topic. For example,

Nov 17 10:06:06 HOSTNAME kafka-server-start.sh[17252]: [2019-11-17 10:06:06,188] INFO [Partition topicname-14 broker=1] Expanding ISR from 1 to 1,3 (kafka.cluster.Partition)

The "expand" logs arrive ~7 seconds after the "shrink" logs, and this repeats itself every 1-5 minutes.

06:54:27 - Shrinking >

06:54:32 - Expanding

06:55:47 - Shrinking >

06:55:52 - Expanding

06:57:07 - Shrinking >

06:57:13 - Expanding

07:01:27 - Shrinking >

07:01:36 - Expanding

I didn't find anything that seems out of the ordinary on the zookeepers side, and nothing sticks out on the other logs files (controller.log, state-change.log, kafka-authorizer) while these show up on the server.log file.

The load is pretty balanced between the brokers, we recently added 2 more brokers but the problem is from before the addition. No broker seems too strained or anything, and they're all aligned configuration-wise.

This is the broker's server.properties:

ssl.key.password=XXXX
authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
ssl.keystore.password=XXXX
advertised.listeners=SASL_SSL://HOSTNAME.FQDN:9092
ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks
ssl.keystore.filename=kafka.keystore.jks
zookeeper.connect=A:2181, B:2181, C:2181
security.inter.broker.protocol=SASL_SSL
super.users=User:admin
ssl.truststore.credentials=keystore_creds
jmx.port=9999
ssl.keystore.credentials=keystore_creds
log.roll.hours=24
ssl.truststore.location=/etc/kafka/secrets/kafka.truststore.jks
delete.topic.enable=TRUE
message.max.bytes=2097152
ssl.truststore.password=XXXX
broker.id=1
ssl.key.credentials=keystore_creds
log.dirs=/var/lib/kafka/data
ssl.truststore.filename=kafka.truststore.jks
listeners=SASL_SSL://IPADDRESS:9092
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=PLAIN
log.retention.ms=86400000
log.retention.bytes=536870912000
auto.create.topics.enable=false
zookeeper.session.timeout.ms=10000
num.partitions=18
default.replication.factor=2

Data does go successfully in and out the cluster, my problem is that producers are disconnected from my servers each time that happens and the shrinking & expansion of the partitions must cost a lot to the system and it causes producers' queue to grow until their local queue gets full. The producers are configured to connect to a VIP, not to an array of servers or specific servers.

Let me know if there's any other info I can provide to help research the cause of the issue, Thanks

Solution

After reading this - https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/ - I've increased the replica.lag.time.max.ms to 20,000 (from the default 10,000), and now the ISR expansion and shrinkage stopped