how do I scale out Flink while sharing the same state?

The semantic of the workload is the following:

Flink operator reads events from the same Kafka topic. Each event needs to be processed by an expensive function f exactly once, ideally, if not at least once. There is correlation between events so each event should be processed based on the cumulative state (accumulated by events from initial state).

How can we scale horizontally for this use case in Flink? I want to process events concurrently but all event processing depends on the same state. In my use case the size of the state will first climb to and then fluctuate around one terabyte.

Solution

If your application needs to have a single, centralized data structure that is accessible to every event, then that application won't be horizontally scalable.

Flink approaches horizontal scaling by independently processing partitions of the data streams. This is usually done by computing a key from every event, and partitioning the stream around that key. State is maintained independently for each distinct key, and the limit to horizontal scaling is the number of distinct keys (the size of the key space). Rescaling is handled automatically, and is implemented by re-sharding the set of keys among the parallel instances.

Flink also supports non-keyed state, but the basic principle still applies: scaling can only be achieved by partitioning the stream, and maintaining state independently within each partition.