I am trying to understand the architecture of Kafka streams API and came across this in the documentation:
An application's processor topology is scaled by breaking it into multiple tasks
What are all the criteria to break up the processor topology into tasks? Is it just the number of partitions in the stream/topic or something more.
Tasks can then instantiate their own processor topology based on the assigned partitions
Can someone explain what the above means with an example? If the tasks are created only with the purpose of scaling, shouldn't they all have the same topology?
Tasks are atomic parallel units of processing.
A topology is divided into sub-topologies (sub-topologies are "connected components" that forward data in-memory; different sub-topologies are connected via topics). For each sub-topology the number of input topic partitions determines the number of tasks that are created. If there are multiple input topics, the maximum number of partitions over all topics determines the number of tasks.
If you want to know the sub-topologies of your Kafka Streams application, you can call Topology#describe()
: the returned TopologyDescription
can either be just printed via toString()
or one can traverse sub-topologies and their corresponding DAGs.
A Kafka Streams application has one topology that may have one or more sub-topologies. You can find a topology with 2 sub-topologies in the article Data Reprocessing with the Streams API in Kafka: Resetting a Streams Application.