I have a pipeline that consumes data with the following shape :
case class Foo(source: String, destination: String){def key=source+destination}
I want to remove all source+destination
duplicates that arrive in the same hour and then I want to count all calls that arrives for a destination in the same hour. I created a pipeline with a src ~> timewindow1(1 hour, keyBy:key) ~> timewindow2(1 hour, keyBy:destination) ~> ...
should I use timewindowAll
in timewindow2 ?
You should only use timeWindowAll
in cases where you don't want to have key-partitioned windowing. Since you are keying by destination, you should use timeWindow
, not timeWindowAll
.