Search code examples
apache-kafkaspark-streamingapache-kafka-streamsevent-driven

KTable initialization and persistence


This is more of an architectural question. I'm learning about Event-Driven Architecture and Streaming Systems with Apache Kafka. I've learned about Event Sourcing and CQRS and have some basic questions regarding implementation.

For example, consider a streaming application where we are monitoring vehicular events of drivers registered in our system. These events will be coming in as a KStream. The drivers registered in the system will be in a KTable, and we need to join the events and drivers to derive some output.

Assume that we insert a new driver in the system by a microservice, which pushes the data in a Cassandra table and then to the KTable topic by Change Data Capture.

  1. since Kafka topics have a TTL associated with them, how do we make sure that the driver records are not dropped?

  2. I understand that Kafka has a persistent state store that can maintain the required state, but can I depend on it like a Cassandra table? Is there a size consideration?

  3. If the whole application, and all kafka brokers and consumer nodes are terminated, can the application be restarted without loss of driver records in the KTable?

  4. If the streaming application is Kubernetes based, how would I maintain the persistent disk volumes of each container and correctly attach them as containers come and go?

  5. Would it be preferable to join the event stream with the driver table in Cassandra using Spark Streaming or Flink? Can Spark and Flink still maintain data locality as their streaming consumers will be distributed by Kafka partitions, and the Cassandra data by I don't know what?

EDIT: - I realized Spark and Flink would be pulling data from Cassandra on the respective nodes depending on what keys they have. Kafka Streaming has the advantage that the Stream and KTable to join will already be data local.


Solution

  • KTables don't have a TTL since they are built from compacted topics (infinite retention).

    Yes, you need to maintain storage directories for persistent Kafka StateStores. Since those stores would be on-disk, no records should be dropped from them upon broker/app restarts until you actively clear the state directories from the app instance hosts.

    Spark/Flink do not integrate with Kafka Streams stores, and have their own locality considerations. I do believe Flink offers RocksDB state, and both broadcast data for remote-joins, otherwise, joining Kafka record keys requires both topics have matching partition counts - this way partitions are assigned to the same instances/executors, similar to Kafka Streams joins.