Search code examples
apache-flinkflink-streaming

Apache Flink: Using filter() or split() to split a stream?


I have a DataStream from Kafka which has 2 possible value for a field in MyModel. MyModel is a pojo with domain-specific fields parsed from a message from Kafka.

DataStream<MyModel> stream = env.addSource(myKafkaConsumer);

I want to apply window and operators on each key a1, a2 separately. What is a good way to separate them? I have 2 options filter and select in mind but don't know which one is faster.

Filter approach

stream
        .filter(<MyModel.a == a1>)
        .keyBy()
        .window()
        .apply()
        .addSink()

stream
        .filter(<MyModel.a == a2>)
        .keyBy()
        .window()
        .apply()
        .addSink()

Split and select approach

SplitStream<MyModel> split = stream.split(…)
    split
        .select(<MyModel.a == a1>)
        …
        .addSink()

    split
        .select<MyModel.a == a2>()
        …
        .addSink()

If split and select are better, how to implement them if I want to split based on the value of a field in MyModel?


Solution

  • Both methods behave pretty much the same. Internally, the split() operator forks the stream and applies filters as well.

    There is a third option, Side Outputs . Side outputs might have some benefits, such as different output data types. Moreover, the filter condition is just evaluated once for side outputs.