Search code examples
multithreadingparallel-processingapache-flinkflink-streaming

Is one Task one thread in Apache Flink


I'm a newbie on Flink. As my understanding, in Flink, a TaskManager can be devided into more than one slots, one slot can be assigned more than one tasks and one task is one thread.

Let's see the example WordCount:

enter image description here

As my understanding, one task is exactly one thread, there are three tasks: Source + map(), keyBy()/window()/apply() and Sink. So each of them has its own threads, meaning that we need three threads for this example. We can put the three tasks (three threads) into one slot.

However, now I'm reading its official doc: https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html

A Flink program consists of multiple tasks (transformations/operators, data sources, and sinks). A task is split into several parallel instances for execution and each parallel instance processes a subset of the task’s input data. The number of parallel instances of a task is called its parallelism.

How to understand "A task is split into several parallel instances of execution"? Does "several parallel instances of execution" means multi threads? So one Task can be multi threads?

I'm confused now.


Solution

  • The wording is not perfect; task has sometimes different meanings in different contexts.

    In your example, you are showing the logical representation of a program with 3 tasks. Since it's a logical representation, it cannot be executed and hence thinking of threads does not make any sense.

    When executing such a logical representation, it gets translated into a physical representation. In the most simplest case, for each logical task N physical tasks are spawned, where N is the degree of parallelism of that task. To make it obvious, we started to call the physical tasks subtasks.

    You can roughly say that each subtask corresponds to one thread. However, in the case of operator chains, subtasks are merged to one chain and executed into one thread.

    So in your example, the number of threads is determined by the degree of parallelism of the three tasks. So you get N1+N2+N3 threads. If all tasks have the same degree of parallelism then it's 3*N.