Search code examples
apache-flinkflink-streaming

Flink re-scalable keyed stream stateful function


I have the following Flink job where I tried to use keyed-stream stateful function (MapState) with backend type RockDB,

environment
.addSource(consumer).name("MyKafkaSource").uid("kafka-id")
.flatMap(pojoMapper).name("MyMapFunction").uid("map-id")
.keyBy(new MyKeyExtractor())
.map(new MyRichMapFunction()).name("MyRichMapFunction").uid("rich-map-id")
.addSink(sink).name("MyFileSink").uid("sink-id")

MyRichMapFunction is a stateful function which extends RichMapFunction which has following code,

public static class MyRichMapFunction extends RichMapFunction<MyEvent, MyEvent> {
    private transient MapState<String, Boolean> cache;
    @Override
    public void open(Configuration config) {
        MapStateDescriptor<String, Boolean> descriptor =
                new MapStateDescriptor("seen-values", TypeInformation.of(new TypeHint<String>() {}), TypeInformation.of(new TypeHint<Boolean>() {}));
        cache = getRuntimeContext().getMapState(descriptor);
    }
    @Override
    public MyEvent map(MyEvent value) throws Exception {
        if (cache.contains(value.getEventId())) {
            value.setIsSeenAlready(Boolean.TRUE);
            return value;
        }
        value.setIsSeenAlready(Boolean.FALSE);
        cache.put(value.getEventId(), Boolean.TRUE)
        return value;
    }
}

In future, I would like to rescale the parallelism (from 2 to 4), so my question is, how can I achieve re-scalable keyed states so that after changing the parallelism I can get the corresponding cache keyed data to its corresponding task slot. I tried to explore this, where I found a documentation here. According to this, re-scalable operator state can be achieved by using ListCheckPointed interface which provides snapshotState/restoreState method for that. But not sure how re-scalable keyed state (MyRichMapFunction) can be achieved? Should I need to implement ListCheckPointed interface for my MyRichMapFunction class? If yes how can I redistribute the cache according to new parallelism key hash on restoreState method (my MapState will hold huge number of keys with TTL enabled, let's say max it will hold 1 billion keys at any point of time)? Could some one please help me on this or if you point me to any example that would be great too.


Solution

  • The code you've written is already rescalable; Flink's managed keyed state is rescalable by design. Keyed state is rescaled by rebalancing the assignment of keys to instances. (You can think of keyed state as a sharded key/value store. Technically what happens is that consistent hashing is used to map keys to key groups, and each parallel instance is responsible for some of the key groups. Rescaling simply involves redistributing the key groups among the instances.)

    The ListCheckpointed interface is for state used in a non-keyed context, so it's inappropriate for what you are doing. Note also that ListCheckpointed will be deprecated in Flink 1.11 in favor of the more general CheckpointedFunction.

    One more thing: if MyKeyExtractor is keying by value.getEventId(), then you could be using ValueState<Boolean> for your cache, rather than MapState<String, Boolean>. This works because with keyed state there is a separate value of ValueState for every key. You only need to use MapState when you need to store multiple attribute/value pairs for each key in your stream.

    Most of this is discussed in the Flink documentation under Hands-on Training, which includes an example that's very close to what you are doing.