Search code examples
apache-kafkaarchitecturepersistenceapache-kafka-streams

When Kafka Streams GlobalKTable is a good choice as a data store in microservices world?


I'm new in Kafka Streams world. I'm wondering when to use Kafka Streams GlobalKTable (with compacted topic under the hood) instead of regular database for persisting data. And what are advantages and disadvantages of both solution. I guess both ensure data persistence on the same level.

Let's say there is an simple e-commerce app having users registering and updating their data. And there are two microservices - first one (service-users) is responsible for registering users and the second one (service-orders) is responsible for placing orders. And now there are two options:

  1. When new user registers, service-user accepts request, save newly registered user data in it's database (SQL or noSQL, doesn't matter) and then send event to Kafka to propagate this to other services. service-orders receives such event and store necessary user data in it's own database. It's like a most common pattern (from my experience).

and now the second approach with GlobalKTable:

  1. When new user registers or update, service-user accepts request and send event with user data snapshot to Kafka. service-user and service-orders use GlobalKTable to read information about users.

When should I use which solution? Which solution is better in which cases? What are advantages and disadvantages of both approaches? Doesn't the second approach breaks the rule 'each microservice should maintain it's own data in it's own database'?

Hope I explained my considerations well and they make sense at all.


Solution

  • In general the adventages of GlobalKTable are:

    • You can do a Foreign-Key Join to GlobalKTable
    • Application has a full data set in memory, the data set is automatically loaded during application startup and all data modifications are automatically synchronized across all instance. Comparing it to the architecture with external database, you don't need to communicate (via network) with any other resource (like relational database) during messages processing, so it is obvious that processing is much faster and as a result you can process large amount of data quickly. When you'd like to achieve similar performance of processing, you need implement by your own some kind of in memory cache (like Guava) and then, you need to solve all issues connected with proper caching management - warming, refreshing, evicting.

    And the main disadvantages are:

    • Application has a full data set in memory, it is advantage but it can be very big issue, all depends on, how big is your data set, or how you model your data. Referring to your example, storing all users orders in GlobalKTable sounds like very bad idea, the data set will grow very fast, and the size of data is growing with time, so after few months/years of running application on production, the data set can has gigabytes and it will continuously grow. When we still like to store orders in GlobalKTable to efficent processing, we need to desing our data model differently. Probalby our entities (Orders, Documents etc) has some life cycle, like: new, paid, closed etc., few of them are terminating - I mean, there will be no further processing on entity with given id, (for example closed Order), so if there will be no processing, there is no need to store data in memory, we can forward it to some other storage, like Elasticsearch and remove it from GlobalKTable. We can name our data set with orders during processing hot storage and data set with terminated orders cold storage. Long story short: having only active/hot Orders in GlobalKTable could be a good idea.
    • Quering GlobalKTable is limited to iterating over all data set, sub set or getting data by record key, or key composed with timestamp
    • Processing based on state in external database is broadly used for many years, so, many developers know how to evolve and maintain that kind of applications. We cannot say the same of storing state in Kafka compacted topics.