Search code examples
javaapache-flinkflink-streaming

How does Flink countWindow work in detail


I'm learning Flink with simple toy examples.

I have adapted the WindowWordCount example from here and run it on this simple data file

cat data.txt
a a b c c

I ignored the slideSize and only do countWindow(windowSize) (hence it's called tumbling window according to this post for multiple window sizes. I'm confused by the results.

When windowSize = 1, the output is

(a,1)
(a,1)
(b,1)
(c,1)
(c,1)

which makes total sense. But when windowSize = 2, the output is

(a,2)
(c,2)

where is count for b?

when windowSize = 3, the output is empty, which I don't understand either.

Can anyone help me understand how the outputs are produced in the cases of windowSize = 2 and windowSize = 3, please?

My code is on https://gist.github.com/zyxue/bc566180d2b01f1e2e77c1bbe3a7c5e5

And I run it with command like

./bin/flink run /path/to/target/myflinkapp-1.0-SNAPSHOT.jar --input data.txt  --output /tmp/out --window 1

Solution

  • The Trigger for CountWindow only triggers the window function for complete windows -- in other words, after processing windowSize events for a given key, then the window will fire.

    For example, with windowSize = 2, only for a and c are there enough events. Since there is only one b, the job ends with a partially filled window for b.

    You can use a custom trigger that also reacts to a timeout if you want to generate reports for partial count windows.