Search code examples
elasticsearchcqrslagom

CQRS (Lagom) elasticsearch read-side


I've read that ElasticSearch isn't the most reliable in terms of durability, but I would like to use it to store data on the read-side for optimal searching.
If we store events (write-side) in a cassandra database, that means that data is never really lost.

I don't really understand what is meant with 'data durability'.
If we use ES on the read-side, does that mean that some data may not be properly imported? Does it mean that one day data may randomly be lost, or the risk that all data may one day just have disappeared?

The use case is a Twitter-like geolocation based app.
How reliable is it in the end to use ES exclusively on the read-side, without needing a more reliable datastore (write-side) to store the data?
Depending on what is meant with this "durability", I wonder what measures should be taken to replay events and keep ES consistent at all times.

Thanks


Solution

  • I don't have a huge amount of experience running ES in production, but essentially, ensuring that when you persist data, it stays persisted, especially in a distributed system, is hard. There are many, many edge cases that are very hard to get right, and it takes time for a database to mature and sort those edge cases out. A less durable database is one that probably hasn't ironed all these issues out.

    Of course, ElasticSearch is popular open source database with a thriving community maintaining it, so there's likely no well defined cases where "your data will be lost in this circumstance", rather, there's likely cases that either haven't been come across yet, or when they have been come across by users in the wild, the users that came across them didn't care enough to debug it because they were only using ES as a secondary data store and were able to rebuild it from their primary data store. Whenever a case is identified that ES loses data under well understood circumstances, the maintainers of ES would be quick to fix that.

    The most typical use cases for ES are as a secondary database store, and in such a use case, durability isn't as important because the data store can be rebuilt from the primary. Accordingly, you'll find durability isn't as high a priority to the maintainers of ES because their users aren't asking for it - that's not say it's not a high priority, just relative to other databases, it's not as high.

    So, if you use ES, you've got a higher chance of encountering bugs where you'll lose data, than with other databases that are either more mature or put more of a focus on durability in their development.

    As to whether you should regularly drop your ES database and replay the events, it really depends on your use case and how important it is for your ES database to be consistent. A lot of the edge cases around ES's durability probably result in major corruptions with significant data loss - ie, you'll know if it happens, so there's no need to drop and replay regularly in that case. Another thing to consider is that because of the way CQRS read sides work, you'll only have a limited number of writers to your ES store, and you can easily control that concurrency. What this means is that a spike in load won't result in a spike in concurrent writers, what will happen is that your ES store might temporarily lag behind in consistency from your primary store. Due to this, you're probably less likely to encounter the edge cases that might trigger ES to lose data.

    So, you're probably fine not bothering dropping and rebuilding unless something catastrophic happens, unless the consequences of silently losing small amounts of data in a way that you won't notice are so high that the incredibly small chance that that might happen is unacceptable.