Search code examples
apache-flink

Flink State backend keys atomicity and distribution


After reading the flink docs, (relevant part noted below) I still did not completely understand the atomicity and key distribution.

i.e consider a graph consisting of keyby->flatmap(containing a map state), and parallelism set as 1 with 4 task slots, does flink ensure that each key exists only once (in one task slot) in the distributed environment, and is it the atomic unit? Thanks in advance to all helpers.

You can think of Keyed State as Operator State that has been partitioned, or sharded, with exactly one state-partition per key. Each keyed-state is logically bound to a unique composite of <parallel-operator-instance, key>, and since each key “belongs” to exactly one parallel instance of a keyed operator, we can think of this simply as <operator, key>.

Keyed State is further organized into so-called Key Groups. Key Groups are the atomic unit by which Flink can redistribute Keyed State; there are exactly as many Key Groups as the defined maximum parallelism. During execution each parallel instance of a keyed operator works with the keys for one or more Key Groups.


Solution

  • For any given parallel operator, all events with the same key are processed by the same operator instance -- i.e., in the same task slot.

    Flink organizes keys into key groups, and every key (and its state) is permanently associated with a specific key group. Furthermore, each task slot is responsible for processing the keys for one or more key groups.

    The documentation you've quoted uses the phrase "atomic unit" to mean "indivisible", and this becomes relevant when considering what happens when a Flink job is rescaled (i.e., when the parallelism is changed).

    When a Flink job is rescaled, the number of instances of a parallel operator will change, which requires redistributing state. The granularity at which this redistribution (or resharding) of the state is done is not key by key, but is instead larger -- it's done at the level of key groups. Thus key groups are the atomic unit of redistributing keyed state.

    For much more on this topic, see the section of a data Artisans blog post about "State in Flink and Rescaling Stateful Streaming Jobs".