Search code examples
apache-flinkamazon-emrflink-streamingtaskmanager

What is Taskmanager, Task, Slots, Parallelism, CPU cores in Flink?


Can anyone please help me understand the meaning and difference between Task slots, parallelism and cpu cores in a Flink application?

Also, If I have an EMR cluster with 1 master node and 4 core nodes. Each core node is having 4 vCore, 8 GiB memory and EBS Storage:64 GiB. I have 7 flatmap function in my code. (I haven't changed any default configuration) I would like someone to help me understand how many task managers, parallelism, task and task slots are available for my job?


Solution

  • For definitions, see https://stackoverflow.com/a/53620443/2000823 and https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/glossary.html.

    To understand how your specific cluster is provisioned, the easiest way to go about that would be to look in its web interface. There you will find an overview like this

    Flink Web Interface and you will also find a list of taskmanagers and their resources, something like this

    Task Managers

    A task slot has the resources to run one parallel slice of your application; the total number of task slots is the same as the maximum parallelism of the cluster. It's common for each task manager to have one slot, and for each task slot to have one CPU core, but this can be configured differently; I don't know what the EMR default is.

    You should also examine the job graph, which will look something like this

    enter image description here

    to see what its topology looks like, and what sort of parallelism its operators require.