Search code examples
apache-flinkflink-streaming

Multiple jobs or multiple pipelines in one job in Flink


I have a use case where I want to run 2 independent processing flows on Flink. So 2 flows would look like

Source1 -> operator1 -> Sink1

Source2 -> operator2 -> Sink2

I want to re-use the same Flink cluster for both flows. I can think of doing this in 2 ways:

1) submit 2 different jobs on the same Flink application

2) Setup 2 pipelines in same job

I was able to setup the first option, but not sure how to do the second option. Has anyone tried such a setup before? What is the advantage of one over the other?


Solution

  • You can simply create multiple pipelines (with separate or shared source consumers) in your setupJob() method. Here is an example:

    private void buildPipeline(StreamExecutionEnvironment env, String sourceName, String sinkName) {
        DataStream<T> stream = env
                .addSource(getInputs().get(sourceName))
                .name(sourceName);
        stream = stream.filter(evt -> filter());
        ....
    }
    
    @Override
    public void setupJob(AthenaFlinkJobConfiguration jobConfig, StreamExecutionEnvironment env) throws Exception {
        ...
        buildPipeline(env, sourceTopic1, sink1, ...);
        buildPipeline(env, sourceTopic2, sink2, ...);
        ...
    }
    

    Here is a quick contrast of both approaches. The Pros/Cons of using separate jobs:

    • [+] Code is simpler.
    • [+] Greater flexibility to set low-level configuration (fault tolerance mechanism, heap size, parallelism, etc.)
    • [-] Higher infrastructure costs since resources are not shared.
    • [-] Maintenance & monitoring are more complex and time consuming.

    The benefits of using separate pipeline in a single job:

    • [+] Monitoring and debugging a single job is easier.
    • [+] Hotfixes are committed into a single repo and deployed to a single environment.
    • [+] Economical: decreases infrastructure hardware and operational costs.
    • [-] Can't bound single pipeline usage.
    • [-] Failures in one pipeline with affect the other pipeline.
    • [-] Back-pressure in one pipeline could affect the whole job since a single checkpoint is snapshotted per job.