Search code examples
apache-flinkflink-streaming

Flink: what is the relationship between parallelism and slot nums


It seems that the number of slots allocated should be equal to the parallelism.
And the number of task managers should be equal to parallelism/(slot per TM).
But the application below is not like this.

The topology is as below. enter image description here

The parallelism is set to 140, and one slot per TM.
But only 115 slots are allocated. enter image description here

And the application throwed an exception after several minutes. enter image description here

The exception told that "Could not allocate all requires slots within timeout of 300000 ms. Slots required: 470, slots allocated: 388".

There are several questions here.

  1. "Slots required: 470", the calculation of this number seems to be regardless of slot share. Why?
  2. "slots allocated: 388", in fact, I only have 195 slots left, and the number of allocated slots are 115. Why?
  3. The parallelism is set to 140, but only 115 slots are allocated. Why?

Solution

  • With the help of workmates, I finally find out the key point.

    It is indeed that some resource has been exhausted.

    Though as a whole, there are still some free memories and cores. But on a machine's view, a part of the machines have some free memories, but no sufficient cores. And other machines have some free cores, but no sufficient memories.

    The situation is like the table below(We need 4G for one TM in our situation).

    enter image description here

    Thus, containers can be started on none of these machines. Because we cannot find a machine with both sufficient memories and cores. And the application hangs on CREATING.

    And about the message in the exception(which is thrown in version 1.6.4): "Could not allocate all requires slots within timeout of 300000 ms. Slots required: 470, slots allocated: 388". I think, in consideration of the number, 'slot' here actually means 'subtask', not 'core'.

    In version 1.11.0, the message in the exception changes, it says 'Could not allocate the required slot within slot request timeout. Please make sure that the cluster has enough resources.' This description maybe more appropriate.