I want to know some mechanisms related to keyedstream. The code is as follows:
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(3)))
If I want to implement window wordcount.
Q1:Is there only one key in each window or multiple keys?
Q2:For the functions in the window, I only use simple sum++ or need to handle the sum of multiple keys through the hashmap in the window like Apache Storm.
Thank you for your help.
Even if there are actually multiple keys per window, each call to your process
/reduce
/sum
/aggregate
function is made with elements with the same key.
In your example you can just use sum
and Flink will take care of everything:
text.flatMap(new Tokenizer())
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(3)))
.sum(X)
If you chose to go with a reduce
instead...
text.flatMap(new Tokenizer())
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(3)))
.reduce(new ReduceFunction<Tuple2<String, Integer>>(){
@Override
public Tuple2<String, Integer> reduce(final Tuple2<String, Integer> first, final Tuple2<String, Integer> second) {
(... do something with the guarantee that first[0] == second[0] (same key) ...)
}
});