Search code examples
apache-flinkglobal-object

Flink shared object across jobs


I have a list of streaming jobs running in Flink, each job has different parallelisms.

The jobs process documents. Now I want to make a snapshot of the document that each job last processed. I can use Redis to store the document ID and the Kafka topic. But I don't want to open Redis connections for the total number of each job's parallelisms (e.g. 300 connections for 30 jobs of 10 parallelisms each)

My questions are:

  1. Is it better that I put the redis in the sinking part not the RichFlatMapFunction? Do I need to only open 30 connections for 30 jobs, regardless of the parallelisms?

  2. Is there even a better way to make the Object, that writes to Redis, shared across the jobs, so that only one connection is needed?


Solution

  • Create a side output with the document ID/time of processing, and run that into an operator with a parallelism of 1. Store this data wherever makes sense (Redis, etc)