Search code examples
apache-flinkflink-streaming

What happens if total parallel instances of operators are higher than the parallelism Flink Application?


What happens if total parallel instances of operators are higher than parallelism of the flink system?

Here is the scenario:

  • Let's say I have a standalone flink application with 1 JobManager and 1 TaskManager(has 5 CPU)
  • I have setup the taskmanager.numberOfTaskSlots=5and parallelism.default=5
  • There are 2 data sources(assume that two different kafka topics which each of them five partitions)
  • Chaining strategy disabled for all operators
  • Dataflow of my application (I have only 1 job which includes both two kafka sources):
kafkaSource1.map(Mapper1).sink(sink1);
kafkaSource2.map(Mapper2).sink(sink1);

After deploying this dataflow with 5 parallelism, will TaskManager suffer from overload?

As far as my understanding, Tasks will be spreaded to the TaskManager's slots like this one:

slots

  • If this is correct diagram, in this diagram each slots have 2 different operators's instances. How it will work? It will work parallel or sequencial manner(first kafka1->map1->sink1, then kafka2->map2->sink1)
  • If it is not correct, how it will work, how task will be spreaded to the slots?

Solution

  • The diagram is correct. If you disable operator chaining, then each slot will contain 5 tasks, as shown. Each task will have a Java thread, which will sit blocked on the network until there is input to process. All of these tasks will run independently, in parallel.

    However, disabling operator chaining is a very bad idea. You will pay a large performance penalty for this, because it will cause serialization/deserialization to occur where it isn't needed. (Also, if the mappers are simply doing deserialization from Kafka, you will get better performance if you use an appropriate KafkaDeserializationSchema, and eliminate the mappers.)

    Will the task managers be overloaded? Probably not, provided you make good choices about operator chaining, etc. I would only be worried if the mappers are doing something unusually expensive. But it depends, in part, on the throughput you need to achieve.