Why can't the partition leader election logic in Kafka cluster sit in Zookeeper rather than controller broker doing it?

Quoting this answer -

kafka uses zookeeper for a few things:

cluster membership - the live brokers of a cluster are those who have ephemeral ZK nodes leader election - election of the kafka broker that acts as a controller state storage - some (mostly the older) state is stored in ZK - the configuration for topics, for example. some state that used to be in ZK has been migrated to special topics (consumer offsets) and some newer functionality was written to store state entirely in kafka (transaction logs, for example). the general trend is to stop using state in ZK and instead self-host it (although older parts of the code have never been migrated out).

as for why not use ZK for partition leader election - one reason is there is logic involved. when electing a cluster leader broker there's no preference - any broker will do. this fits well with how ZK-based leader-election works (1st memeber to create and own an ephemeral znode wins).

when choosing a partition leader, however, you need a little bit more logic. for example - you'd like to elect the leader with the "highest watermark" (with the most up to date data, remember replication is generally async). there's also logic around unclean leader election. ZK alone cannot do that, hence it is done by the controller.

Can anyone explain the last 2 paragraphs more?

I tried searching on net, but nothing concrete found.

Solution

Kafka doesn't require Zookeeper anymore, and will remove its dependency in the next major release.

After Kafka 0.9, several Zookeeper-like functions were moved to the broker.

But otherwise, Zookeeper doesn't track topic offsets (i.e. watermarks). That's all the last paragraph is saying