events domain-driven-design microservices cqrs distributed-system

CQRS without Event Sourcing: handle event log failure

As I do not use Event Sourcing in my CQRS Application, I introduced a simple event log which enables me to update the read-store.

This implies that a state change to my application consists of two actions:

Updating the write-model state, e.g. SQL INSERT
Insert the event into the event log

Both write operations have to happen as one atomic operation. Unfortunately, the event log resides within another database so I have to think about a distributed transaction.

Most CQRS samples deal with the saga pattern and they all seem to utilize event sourcing, which makes things much simpler.

My problem is a "half-finished" state change, e.g.

SQL Insert succeeds
Event Log Insert fails

I could come up with a compensating SQL operation (pseudo code):

SQLTransaction.Commit(); // if this fails, all is fine. Nothing to revert
try 
{
    EventLog.Insert(event);
}
catch(Exception ex) 
{
    // Try to undo the SQL stuff.
    CompensatingSQLTransaction().Commit(); 
    // uh-oh! The commit fails!!

    // What now? Do a Retry?
}

Are there any concepts that could help me out? I thought about the following scenario to prevent an out-of-sync read-database:

Each event has a sequence number
If the read-side replication detects an unprocessed event (e.g. receives 40, then 42), it queries the event log for event 41.
If event 41 is not available, the system will stop replicating any events until someone took a closer look.

This requires manual maintenance, but prevents the read-db from getting out-of-sync.

Any real life experiences?

Solution

Both write operations have to happen as one atomic operation.

There's a really important question to raise at this point: why? What is the cost to the business if the remote event log is not synchronized with the book of record?

If you don't need that synchronization, then a straight forward approach is to put a copy of your event log into the same database as your write model. Udi Dahan discusses this approach in Reliable Messaging Without Distributed Transactions. After the write transaction succeeds, you can then replicate the events from the SQL store to the remote event log.

This gives you a remote event log that is always consistent with some state in the past, but doesn't promise to be caught up to the present.

This is usually good enough; after all, the event log itself is a snapshot of the past, and the book of record may be changing while the representation of the event log is copied to the consumer.

But if that won't do, your choices are to find a distributed transaction engine that provides an acceptable compromise, or to use sagas to undo your changes to the local store if the remote write fails.

Yan Cui's discussion of the saga pattern in aws, which in turn references Caitie McCaffrey's 2015 talk on sagas in distributed systems, raises this point:

Because the compensating actions can also fail so we need to be able to retry them until success, which means they have to be idempotent.

In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

So yes - you retry.