Search code examples
apache-kafkaapache-kafka-streams

Shared Kafka StateStore Best Practices


When creating Processor API Topology I noticed that Topology#addStateStore(StoreBuilder, String...) accepts multiple processors, meaning that one state store can be shared by multiple processors.

Are there any caveats with this design? Is it possible to lose data by storing value if a key is not present when in fact some other processor is just storing a value for this key? I guess I am asking whether usual race condition problems can arise.

Would it be any different if processors belong to different sub topologies?
Also, what happens when processors attached to sources with a different number of partitions share the same state store? How will this affect state store sharding?


Solution

  • There are no race conditions. If a single store is connected to multiple processors, both processors are executed in a single thread.

    However note, it's not defined in which order both processors would access the store, ie, if there is a single input record, you don't know which processor will be executed first.

    Would it be any different if processors belong to different sub topologies?

    That is not possible. If two processor access the same store, they will always be in the same sub-topology.

    Also, what happens when processors attached to sources with a different number of partitions share the same state store? How will this affect state store sharding?

    In general, that is not recommended, because your input data would not be co-partitioned (ie, records with the same key, are most likely in different partitions in the two topics). The program would still be executed using the larger partition count to create store shards. For some shards (for the higher partition numbers), the corresponding task would read data from only one topic, because there is no corresponding partition in the other topic.