Search code examples
apache-flink

Flink Map function with multi-parallelism, and how to make sure the order of the final sink


the pipeline simple code is fellows:

source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()

how to make sure the order of out?


Solution

  • To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)

    Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.

    With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.

    If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.

    If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.