Search code examples
apache-kafka-streams

KTable with billions unique keys


My requirement is to build real-time aggregation pipeline using kafka stream with large volume of data. Based on the estimate, the possible unique keys will be ~ 3 to 4 billion and total message size ~5TB.

The high level architecture is, read from a kafka topic, aggregate it based on certain key columns and publish aggregated results into KTable (kafka compact topic). KTable is used to read the previous state and update with new aggregated results.

Is KTable scalable with billions of unique keys?


Solution

  • Sounds possible. Kafka Streams uses RocksDB as default storage engine allowing to spill to disk and thus a properly scaled-out app can hold huge state. One main consideration will be how many shards you need for good performance -- beside the actual storage requirement, also the input data rate needs to be considered.

    Also note, because RocksDB does spill to disk, if an instance goes down and you restart it on the same machine, it's not necessary to re-load the state from the Kafka changelog topic as the local state will still be there. (For Kubernetes deployments, using stateful sets help for this case.) In general, if you have large state and want to avoid that state is migrated (ie, trade-off some "partial/temporary unavailability" for a more "stable" deployment), you should consider using static group membership.

    For a size estimation, note that the number of input topic partitions determined the maximum number of instances you can use to scale out your application. Thus, you need to configure you input topic with enough partitions. For client side storage estimation, check out the capacity planning docs: https://docs.confluent.io/current/streams/sizing.html