Search code examples
apache-flinkflink-streaming

Flink keyby/window operator task execution place and internals


I am new big for Flink. I am writing a simple Flink POC program, where I am able to get expected output. But I am not able to get the internals about the key by and window operation. Following is my code,

environment
.addSource(consumer)
.name("MyKafkaSource")
.setParallelism(2)
.flatMap(pojoMapper)
.name("MyPojoMapper")
.setParallelism(2)
.keyBy(new MyKeyExtractor())
.timeWindow(Time.seconds(60))
.apply(new SumFunction())
.name("MySumFunction")
.setParallelism(2)
.print()
.name("S3FileSink")
.setParallelism(2)

While deploying the Flink job I get the following graph in Flink UI,

Task Visualizer

From the above image I understood it totally it uses 2 tasks and 4 slots, each task with 2 parallelism. First task has source, pojo mapper second task has sum function, sink function.

Now the question are,

  1. Where does the KeyBy and Window operation lives? In first task or second task? Why it is not visible in above image? Is there is any way to visualise that?

  2. Let's say for 1 window (60 sec interval), I receive 100 distinct keys and each key receives 5 records for 1 minute, so how many window objects are created internally for 1 window interval? I am assuming that 100 window objects are created and each window object will hold 5 records. Whether my assumption are correct or not? If not could any one please explain what happens internally? Also if possible please share any documents related to this.


Solution

  • Because they are connected by data forwarding connections, the source and flatmap operators are chained together into the same task, and the same applies to the window and sink. But since the flatmap and window are connected by a keyBy, a network shuffle is needed there.

    Thus your job has a total of 4 tasks: 2 instances of source plus flatmap, and 2 instances of window plus sink. These 4 tasks are deployed into 2 task slots, each of which has a source/flatmap task, and a window/sink task.

    keyBy is depicted where it says HASH on the diagram. keyBy isn't an operator, but is instead a description of how the operators before and after the keyBy are connected.

    These two lines of code

    .timeWindow(Time.seconds(60))
    .apply(new SumFunction())
    

    together describe the window operator, which is shown on the diagram as mySumFunction. The window is in the second task.

    You are correct in assuming that there is a window for each distinct key, and each of those 100 windows contains 5 records.

    For resources for learning more about Flink, I can recommend the Apache Flink Training, Stream Processing with Apache Flink, and searching for Flink Forward talks on YouTube.