Search code examples
apache-kafkaapache-kafka-streamsdata-partitioning

Kafka Streams - How to efficiently join with a large, non-copartitioned store/topic


We have a stream of web events.

The event is partitioned by (domain, uid).

All events explained here are from same domain. There are thousands of domains, very uneven in traffic (hence that partitioning).

Let's say we have events from one unregistered user (uid1). We have events that come from the same unregistered user from a separate device, which creates a new uid (let's call it uid2).

When we have a registration on uid1, it registers with an email (email1). Later, from second device, it logs in - so we can know both uids come from the same user.

When this happens, we could check a state store for the user identifier (e.g. email) on login to see if it exists and hence get the correct user.

However, since they are different uids, they will not be copartitioned. Partitioning just by domain instead of (domain, uid) is not desirable.

Separately, size of such a user store may be very big to be kept in each of the application instances (millions of records), so it may be too much for a GlobalKTable store.

How to work this out?


Solution

  • What occurs to me is that if we have uid1 which corresponds to uid2, then we could store the user data of uid1 in a local KTable on the uid2 instance. Because uid2 always goes to that instance then we only need that the uid1 be stored in the KTable on that instance (not in the global KTable).

    So you could have a global store, outside of Kafka, perhaps in a distributed in-memory key/value store. On receiving uid2 and not knowing the user but having the email address, you check the KTable, if it is not there then you look for it in the global store outside Kafka, then store it in the KTable for future local access. From that point on, you'll always have the user data of uid2 local to its instance.

    This way you only pay the cost of a network call to the key/value store the first time you see a new login from an unknown uid.