Search code examples
apache-kafkaapache-flinkflink-streaming

Flink custom partitioner example


The use case that I am trying to tackle is as follows:

  • We have a data stream flowing in from Kafka
  • We would like to guarantee that message/records containing the same value for a particular entity gets processed by the same operator.
  • We would like to maintain state on this Operator so that we are able to enrich future messages.

So for example :

  • Lets assume that all messages are byte arrays having representing encoded data.
  • All messages which have a particular value within the encoded data should be processed by a single operator.
  • This is so that when we receive certain special messages that also correspond to the same value these can be stored as state on the Operator (after the partitioner) and can be used to enrich subsequent messages.

Questions:

  1. Will a custom partitioner help with this?
  2. If not what would be a good solution for this?
  3. Can someone share an example of a Custom partitioner in Flink for a Datastream. I was not able to find any complete examples.

Solution

  • A custom partitioner would help, but it is not necessary for you case.

    You can just extract the grouping value from you messages and use it as grouping-key. Thus, after the sources read the data, you use a map to extract the value (eg, Record -> (groupingValue, Record) with data types byte[] -> Tuple2<keyType,byte[]> if you want to keep the raw message). Afterwards, you can use .keyBy(0) and apply whatever operator you want on it. keyBy ensures, that all records with the same value in the first field of Tuple2 are processed by the same operator.