Search code examples
apache-kafkaapache-kafka-streams

Stream re-partitioning on DSL toplogy with selectKey and transform


I feel like I am probably missing something very basic, but I'll ask anyway.

There is input topic with multiple partitions. I'm using selectKey as part of DSL topology. The selectKey always returns the same value. My expectation is that after internal re-partitioning triggered by selectKey() the next processor in the topology will be called on the same partition for the same key. However the next processor that is transform() is called on different partitions for the same key.

Topology:

    Topology buildTopology() {
        final StreamsBuilder builder = new StreamsBuilder();


        builder
            .stream("in-topic", Consumed.with(Serdes.String(), new JsonSerde<>(CatalogEvent.class)))
            .selectKey((k,v) -> "key")
            .transform(() -> new Processor())
            .print();

        return builder.build();
    }

Processor class used by transform

public class Processor implements Transformer<String, CatalogEvent, KeyValue<String, DispEvent>> {

    private ProcessorContext context;

    @Override
    public void init(ProcessorContext context) {
        this.context = context;
    }

    @Override
    public KeyValue<String, DispEvent> transform(String key, CatalogEvent catalogEvent) {
        System.out.println("key:" + key + " partition:" + context.partition());
        return null;
    }

    @Override
    public KeyValue<String, DispatcherEvent> punctuate(long timestamp) {
        // TODO Auto-generated method stub
        return null;
    }

    @Override
    public void close() {
        // TODO Auto-generated method stub

    }
}

"in-topic" has two messages with random UUIDs as keys i.e. "8f45e552-8886-4781-bb0c-79ca98f9d927", "a794ed2a-6f7d-4522-a7ac-27c51a64fa28", the payload is the same for both messages

The output from Processor::transform for two UUIDs are

key:key partition: 2
key:key partition: 0

How can I change the topology to make sure that messages with the same key will be arrived on the same partition - I need it to ensure that messages with the same key will go to the same local Kafka store instance (for inserting or updating).


Solution

  • For process() and [flat]transform[Values]() there is no auto-repartitioning. You will need to insert a manual repartition() (or through() in older versions) call to repartition the data. If you compare the JavaDocs (with groupBy() or join() that support auto-repartitioning) you see that auto-repartitioning is not mentioned for them.

    The reason is, that those three methods are part of Processor API integration into the DSL, and thus no DSL operators. Their semantics are unknown and thus we cannot tell if they require repartitioning if the key was change or not. To avoid unnecessary repartitioning, auto-repartitioning is not performed.

    There is also a corresponding Jira: https://issues.apache.org/jira/browse/KAFKA-7608