Search code examples
apache-flinkflink-streaming

How to cache a concurrent hashmap at the operator level in Flink streaming?


Probably similar to the question posted here, but my goal is to share a concurrent hashmap across all parallelisms in a single flat map operator.

I have a hashmap that contains a <Integer, Integer> mapping, and I want this map to be shared among all the task slots that run my flatmap function.

The reason that I want it shared is because my flatmap is keyed via 2 values in a Tuple2, the first value being the key I care about and the second value being a secondary key as I want each event with the two key combo to end up in one worker and aggregated. Therefore, the first key may end up in multiple worker threads, so I want a way to share a gigantic hashmap across all of them like a cache of some sort.

Is this possible with Flink? Is it even a good idea? Thanks.


Solution

  • As soon as you add the second tuple field to the key, you're not guaranteed that each key <A, x> will get sent to a slot that's in the same TaskManager instance. So unless you're running with a single TM, this won't work.