Search code examples
apache-flink

Data/event exchange between jobs


Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.

For example, consider a process with an input/preprocessing stage, a business logic and an output stage. In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.

Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)? If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted? If no, does anyone have experience with such a setup and point me to a possible solution?

Thank you!


Solution

  • An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.

    I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...

    You can use a broadcast stream to communicate/update the dynamic portions of the processing.

    Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.