Apache Flink uses DAG style lazy processing model similar to Apache Spark (correct me if I'm wrong). That being said, if I use following code
DataStream<Element> data = ...;
DataStream<Element> res = data.filter(...).keyBy(...).timeWindow(...).apply(...);
.keyBy()
converts DataStream
to KeyedStream
and distributes it among Flink worker nodes.
My question is, how will flink handle filter
here? Will filter be applied to incoming DataStream
before partitioning/distributing the stream and DataStream
will only be created of Element
's that pass the filter criteria?
Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Yes, that's right. The only thing I might say differently would be to clarify that the original stream data
will typically already be distributed (parallel) from the source. The filtering will be applied in parallel, across multiple tasks, after which the keyBy will then reparition/redistribute the stream among the workers.
You can use Flink's web UI to examine a visualization of the execution graph produced from your job.