Search code examples
architectureapache-flinkflink-streaming

How to understand the function setParallelism in Apache Flink


https://ci.apache.org/projects/flink/flink-docs-stable/dev/parallel.html

I'm reading this doc of Flink and I can't quite understand the part of Execution Environment Level well.

Let's use the example of WordCount.

enter image description here

So if I code env.setParallelism(3); in this example, does it mean that I will have three parallel pipelines of Source + map() --- keyBy()/window()/apply() --- Sink? What makes me confused is if I have three Sinks, how could I get the result correctly?

If there is only one Sink, I think there won't be any issues. I mean no matter how many Source + map() I have, the only Sink can produce one result. But now I have three Sinks...

// Case 1
Source + map() --- keyBy()/window()/apply() ----\
Source + map() --- keyBy()/window()/apply() --- Sink (the only Sink will merge the outputs coming from three pipelines and produce only one result)
Source + map() --- keyBy()/window()/apply() ----/

// Case 2
Source + map() --- keyBy()/window()/apply() --- Sink
Source + map() --- keyBy()/window()/apply() --- Sink
Source + map() --- keyBy()/window()/apply() --- Sink
// There are three sinks, how could I get the result?

So we shouldn't use setParallelism() in this example or I misunderstood something?


Solution

  • There's nothing inherently wrong with having a parallel sink. For example, different instances of the Kafka sink will write to different partitions. The StreamingFileSink will write in parallel to different buckets, the various database connectors can be updating or inserting records for different keys, etc. Scalable stream processing requires that all parts of the pipeline be able to scale, including the sinks.

    In a case like the pipeline you describe, the window and sink can be chained together. If the sink is a print sink and it is used in parallel, then each task manager will write its slice of the results into a local output file. Obviously this isn't very convenient if your objective is to have all of the results together in one place, in which case you will want to set the parallelism of the sink operator to 1. But many applications don't have such a requirement.