Search code examples
apache-flinkapache-stormflink-streaming

Apache Flink:keyby and window operator


I want to know some mechanisms related to keyedstream. The code is as follows:

DataStream<Tuple2<String, Integer>> counts =
            // split up the lines in pairs (2-tuples) containing: (word,1)
            text.flatMap(new Tokenizer())
            // group by the tuple field "0" and sum up tuple field "1"
                    .keyBy(0)
                    .window(TumblingProcessingTimeWindows.of(Time.seconds(3)))

If I want to implement window wordcount.

Q1:Is there only one key in each window or multiple keys?

Q2:For the functions in the window, I only use simple sum++ or need to handle the sum of multiple keys through the hashmap in the window like Apache Storm.

Thank you for your help.


Solution

  • Even if there are actually multiple keys per window, each call to your process/reduce/sum/aggregate function is made with elements with the same key.

    In your example you can just use sum and Flink will take care of everything:

    text.flatMap(new Tokenizer())
          .keyBy(0)
          .window(TumblingProcessingTimeWindows.of(Time.seconds(3)))
          .sum(X)
    

    If you chose to go with a reduce instead...

    text.flatMap(new Tokenizer())
          .keyBy(0)
          .window(TumblingProcessingTimeWindows.of(Time.seconds(3)))
          .reduce(new ReduceFunction<Tuple2<String, Integer>>(){
                @Override
                public Tuple2<String, Integer> reduce(final Tuple2<String, Integer> first, final Tuple2<String, Integer> second) {
                      (... do something with the guarantee that first[0] == second[0] (same key) ...)
                }
          });