Search code examples
apache-flinkflink-streaming

How to access a same variable in different Flink operators


I have a collection, e.g. val m = ConcurrentMap(), normally I can use a method taking it as parameter and different threads can call the method passing the same m.

In flink it may be

val s = StreamExecutionEnvironment.getExecutionEnvironment
s.addSource(new MySource(m))
      .map(new MyMap(m))
      .addSink(new MySink(m))

These params would be serialized to different machines and seems that it could not be shared by different operators. I found that ColocationGroup maybe close to the solution. Is it right? How to do it?


Solution

  • There's no way to share an in-memory data structure between operators, or even between parallelized sub-tasks for the same operator, as each instance of an operator can be running in a separate JVM.

    Normally you'd figure out how to design your workflow to avoid needing to share data, as that will often lead to concurrency and scalability issues.

    You can use broadcast streams to ensure every sub-task of an operator gets the same data, if you can't use partitioning of the data to remove the requirement that every sub-task sees all the data.

    Worst case, you use some shared data store (Cassandra, HBase, etc) for this map of data, but almost always you can avoid that by redesigning your workflow.