Search code examples
domain-driven-designcqrsevent-sourcingevent-driven

In Event-sourcing how to deal with failure in production?


So suppose I want to make an e-commerce system using Event-driven architecture with Event-sourcing. Let's say the user want to buy a product and its price is 1$, but I miss calculate and it becomes 2$. Now the user would lose 2$ from his wallet instead of 1$. So if it was CRUD I could just simply fix the bug and connect to the database host and fix the user wallet (also give him some apologize). but in event-sourcing, as far as I know, we should not edit or delete event (only append) since it's single source of truth. So how should I deal with this kind of failure? One thing I can think of that's to create an admin page which can publish any kind of events and fix it like this.

AccountCreatedEvent { userId: 1, balance: 3 }
ProductPurchasedEvent { userId: 1, price: 2 } // **miss calculate price should be 1$
DepositMoneyEvent { userId: 1, amount: 1 } // manually fixed by admin

I know it seems weird, but what if I really have to fix the bug and also valid the data how do we achieve that in event-sourcing


Solution

  • A common answer is that you look to the domain. For example, what business processes exist to mitigate the contingency that a customer is over charged? Does our business have a process for refunds?

    The "right" answer is to implement that process. The resulting history will look like an event that over charges the customer, and later another event that refunds the overcharged amount to make things right.

    This is, of course, exactly your fix it and apologize approach; the main difference being that we treat the error correction as part of the system, rather than being something we improvise.

    Memories, Guesses, and Apologies (Pat Helland, 2007) is a good starting point.

    Another example would be a fault where the system did the right thing, but wrote down the wrong information. A common pattern here is to process this mistake the way it would be done in accounting - an event is produced to "reverse" the transcription error, and a new event is created to introduce the correct accounting.

    Again, notice that this correction process is part of the domain of accounting. Our job here is to faithfully re-create the error correcting processes that already exist.

    The basic pattern is the same throughout; we add more events to the system to correct the mistakes (and more events to correct mistakes in the corrections; it's turtles all the way down).

    When you've got processes that are triggered by the events that appear in the stream, you may end up playing "chase".

    We overcharged the customer, but this meant that the customer was automatically enrolled in the VIP discount program. When we fix the error, do we also need to remove their VIP status? What happens to the discounted purchases they made before the error was discovered?