events microservices cqrs event-sourcing

How to handle many instances of read model producers in event sourcing/cqrs?

I'm often using kubernetes for my deployments. I don't see how building read model could be done when we want to have multiple read model producers. If I spin up a new service that needs to rebuild its read model I would subscribe to event store and replay all events since the beginning. Then when service is up to speed I would just listen for incoming new events. Everything looks fine when we have single instance of this service but if for example we have 2 instances then they both would receive events and try to apply it twice. After some searching most common solution I found is to use only single subscriber/instance for given read model database. This approach from my perspective have a single point of failure. If the subscriber fails for some reason but don't crash immediately then kubernetes will not spin new instance of this service. How would I handle such case?

Currently I see it like this: CommandService(multiple instances) => EventStore => ReadModelProducerService(single instance) => ReadModel <=> QueryService(multiple instances). If this single instance of ReadModelProducerService that is generating read model fails then the app is basicaly down.

Solution

There are at least three issues of concurrent subscribers that execute the same projection code.

Double load on the target database as they will update the same records or documents
If the projection code isn't idempotent (it should be, but still), it will quickly go to hell
They will override each other's checkpoint (which might be the least problematic)

I'd say that the first problem is the most obvious, and it may lead to some undesired consequences like record locks and timeouts.

I wouldn't be that concerned about a "single point of failure" as there are many other "single points". Event store itself, Kubernetes, the projection sink - all these components can fail.

If you trust Kubernetes more than your own code, you can avoid the situation when the subscription has crashed but stays alive in "zombie mode". The problem is well-known, as well as the solution. You need to add a histogram metric to the subscription to measure the processing rate. Another useful metric is the subscription gap, which will grow if the subscription is slow or stopped. In many cases, you can detect the subscription drop as it gives your application a signal (can't connect to the db, etc), which can be used as a healthcheck, forcing Kubernetes to restart the pod. I wrote about these things in Eventuous docs.