Search code examples
apache-flinkflink-streaming

In-memory, replicated cache for Apache Flink tasks


Im processing a stream that needs to be transformed based on an inventory data. Inherently, the inventory data will be big (dont have exact figure) and will have A LOT of reads and very few writes.

Let's assume I have 3 task managers with 1 slot for each.

How can my transform operator running on 3 different machines use an in-memory, replicated cache?


Solution

  • Normally (with reasonable record size for the enrichment data) you can easily save 100s of K of unique records in state. Since you're joining enrichment data to your stream based on IDs, just use a KeyedCoProcessFunction.

    Often the main challenge here is cold-start, where you want to first make sure your state (the enrichment data) is completely loaded before you start processing the data to be enriched. One simple solution is to first run your workflow in a special mode, where you have a dummy source for the streaming data. Once all of the enrichment data has been loaded, stop it with a savepoint, then restart in normal mode from that savepoint.