Search code examples
apache-kafkaapache-kafka-streams

Kafka Streams: Stream Thread vs Partition of multiple topics


Suppose I have 2 topics say xyz1, xyz2, each having 3 partitions. If I have a single Kafka stream application having 3 threads, can the following scenario occur?

Thread                  Partition
    1       xyz1-partition 0, xyz2-partition 2
    2       xyz1-partition 1, xyz2-partition 0
    3       xyz1-partition 2, xyz2-partition 1

as opposed to:

Thread                  Partition
    1       xyz1-partition 0, xyz2-partition 0
    2       xyz1-partition 1, xyz2-partition 1
    3       xyz1-partition 2, xyz2-partition 2

Essentially, a single thread consuming data from a particular partition of 2 different topics and the partition number can be varying? Assuming we use low-level processor API


Solution

  • Its depends

    Plain Kafka Consumer:

    Kafka Consumer Group consists pool of consumers/instances/processes with the same group.id can either be running on the same machine or distributed machines. Kafka Consumer uses rebalancing to assign partitions on each consumer without overlapping mean one partition can assign at most one consumer process of Consumer Group.

    It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection). In this case, dynamic partition assignment and consumer group coordination will be disabled

    So in case of partition can be assigned to any thread while rebalancing.

    enter image description here

    Kafka Stream:

    Kafka uses stream tasks as a logical unit to assign partition and parallelize process. Kafka Stream creates a number of stream task based on stream partitions and assigns a list of partitions to each task. Once the task assigned to partitions it will stick and manage parallelism on their own topology. As a result stream tasks can be processed independently and in parallel without manual intervention.

    Default implementation of the PartitionGrouper interface that groups partitions by the partition id. Join operations requires that topics of the joining entities are partitioned, i.e., being partitioned by the same key and having the same number of partitions. Copartitioning is ensured by having the same number of partitions on joined topics, and by using the serialization and Producer's default partitioner.here

    So in your case scenario-1 not possible whereas scenario-2 is possible.

    enter image description here