Search code examples
apache-kafkaapache-kafka-streams

Kafka Streams: Increasing topic partitions for an application performing a KTable-KTable foreign key join


Most of the information I find relates to primary key joins. I understand foreign key joins are a relatively new feature for Kafka Streams. I'm interested in how this will scale. I understand that Kafka Streams parallelism is capped by the number of partitions on each topic, however I have a few questions around what it means to increase the input topic partitions.

  • Does the foreign key join have the same requirement to co-partition input topics? That is, do both topics need to have the same number of partitions?
  • How does one add a partitions later after the application has been running in production for months or years? The changelog topics backing each KTable store data from certain input topic partitions. If one is to increase the partitions in the input topics, how does this impact our KTables' state stores and changelogs? Presumably, we cannot just start over and lose that data since it has accumulated over months and years and is essential to performing the join. It may not be quickly replaced by upstream data. Do we need to blow away our state stores, create new input topics, and re-send all KTable changelog topic data to them?
  • How about the other internal "subscription" topics?

Solution

  • Does the foreign key join have the same requirement to co-partition input topics? That is, do both topics need to have the same number of partitions?

    No. For more details check out https://www.confluent.io/blog/data-enrichment-with-kafka-streams-foreign-key-joins/

    How does one add a partitions later after the application has been running in production for months or years?

    You cannot really do this, even if you don't use Kafka Streams. The issue is, that your input data is partitioned by key, and if you add a partition the partitioning in your input topic breaks. -- The recommended pattern is to create a new topic with different number of partitions.

    The changelog topics backing each KTable store data from certain input topic partitions. If one is to increase the partitions in the input topics, how does this impact our KTables' state stores and changelogs?

    It would break the application. In fact, Kafka Streams will check and will raise an exception if it detect that the number of input topic partitions does not match the number of changelog topic partitions.