Search code examples
resteventsdesign-patternsarchitecturemicroservices

microservices' async communication for business transaction


Problem

Inside a microservice architecture, consider a service A which receives a request from an user and persist some data in to its database.

Once persisted, service A has to notify service B. This notification action to service B can happen asynchronously(not needed immediately).

Design

A few ways to implement the above requirement could be:

Approach #1

  1. service A persist(insert + commit) data in to DB
  2. pushes a message into a queue for service B to consume
  3. return success to the user

Approach #2

  1. service A persist(insert + commit) data in to DB
  2. transaction outbox pattern with CDC and KAFKA
  3. service B is a subscriber to a topic in KAFKA and fetches the message

Questions

  1. Option 1 is very simple as compared to option 2. But, it introduces dual write problem. In my previous organization(a mid size start-up), we followed option 1(used aws sqs) widely without any issue. Was it because 99.9% availability SLA given by aws and were lucky to have never faced any major issue? Is dual write problem really that big to worry about?
  2. Is there any other way?

Solution

  • In a microservice architecture like the one you're describing, the primary goal is to ensure that Service A can reliably and efficiently notify Service B once it has persisted data. Let's break down the two approaches you've mentioned and then discuss the dual-write problem and additional alternatives.

    Approach #1: Queue System

    • Service A persists data in its database (insert + commit).
    • Service A pushes a message into a queue (like AWS SQS) for Service B to consume.
    • Service A returns success to the user.

    Pros:

    • It's straightforward to implement.
    • Service A doesn't need to know about the internals of Service B.
    • The queue can handle high loads and smooth out traffic spikes.

    Cons:

    • If Service A successfully writes to the database but fails to send the message to the queue (or if the message gets lost in the queue), Service B won't be notified.

    • Ensuring the data is in sync between the two services can be challenging.

    Approach #2: Transaction Outbox with CDC and Kafka

    • Service A persists data in its database (insert + commit).

    • Transaction outbox pattern is used, where an "outbox" table in the database is updated as part of the same transaction.

    • Change Data Capture (CDC) picks up changes in the outbox table and pushes them to Kafka.

    • Service B subscribes to a Kafka topic and fetches the message when it appears.

    Pros:

    • The dual-write problem is mitigated as both database write and message notification are part of the same transaction.
    • Scalability and Performance: Kafka can handle high throughput and provides additional features like replaying messages or scaling consumers.

    Cons:

    • Requires setting up and managing Kafka and CDC, which can be complex and might introduce new failure modes.
    • More moving parts mean more monitoring, tuning, and maintenance.

    The dual write problem is significant because it affects data consistency and system reliability. If one write succeeds (to the database) but the other fails (to the queue or outbox), the systems can become out of sync. The severity depends on how critical it is for Service B to receive all notifications without loss or delay. In systems where consistency isn't as critical, or where occasional manual intervention is acceptable, the dual-write problem might be less of a concern.

    Additional Approaches

    1. Event Sourcing: Instead of writing directly to the database, Service A writes events to a log. The database and other services become subscribers to this log, ensuring all state changes are captured as events and can be reacted upon by multiple services.
    2. Database Triggers: Use database triggers to automatically send notifications or write to a queue/table when certain conditions are met. This can ensure atomicity between data persistence and notification.
    3. Direct HTTP Call: Service A could make a direct HTTP call to Service B post-persistence. This can be made more reliable with retries, circuit breakers, and ensuring idempotency, but it doesn't offer as much decoupling as other methods.

    Approach #1 might be sufficient for systems where eventual consistency is acceptable or where the volume and criticality of the messages don't justify the added complexity of Approach #2. However, for systems requiring strong consistency and reliability, investing in more robust solutions like Approach #2 or exploring additional approaches might be necessary. Consider the trade-offs in terms of development and operational complexity, performance, and reliability to choose the right approach for your needs.