Search code examples
apache-flinkflink-streaming

Applying Multiple Filters in Parallel Flink DataStream


I would like to run a Flink Streaming application that works on the read-once-write-many analogy. Basically, I would like to read from a firehose, apply different filters in parallel for reach record read from source, and send them to different sinks based on configuration.

How is this possible to do in Flink? I think in KafkaStreams there is a concept where you can do this. Below is an illustration of how I want my Flink DAG to look like: enter image description here


Solution

  • The most simple way to accomplish this would be through the use of the filter() transformation as demonstrated below:

    DataStream<X> stream = ...;
    
    // Filter 1
    stream
        .filter { x -> x.property == 1 }
        .sinkTo(sink1)
    
    // Filter 2
    stream
        .filter { x -> x.property == 2 }
        .sinkTo(sink2)
    
    // Repeat ad nauseum
    

    Alternatively, you could use consider using Side Outputs such that you'd only require a single "filter" function which could handle separating each of your filtered streams into separate outputs that you could then act upon.

    If you absolutely require n different streams, you might have to look at the type of sink that you intend to write to and consider handling the filtering at that level