I'm working on a solution in which we have two databases that are used with the following purposes:
Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:
However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.
So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.
What techniques/options are available to me to solve my kind of problem?
Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.
From Application layer
For the immediate consitency
between two databases, the only way you can achieve them is to do polygot persistence in a transactional way
so that either the same data in both Postgres
and Elasticearch
gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.
So basically GUI --> Application Layer --> Postgres/Elasticsearch
Queue/Real Time Streaming mechanism
You need to have a messaging queue so that the updates are going to the Queue in the event based approached.
GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch
Have a separate application, normally let's call this indexer
. The purpose of this tool is to carry out updates from postgres
and push them into the Elasticsearch
.
What you can have in the indexer
is have multiple single configuration per source
which would have
select *
and index everything
into Elasticsearch
or full crawl
Elasticsearch
only the updated rows
in the Postgres and thereby push them into Elasticsearch or incremental crawl
0
for document which got recently updated, or based on timestamp
to pull records which got updated in last 30 secs/1 min
or depending on your needs. Incremental querystatus
, you need to change the status of this to 1
(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental queryBasically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch
I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.
Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.
Also I recommend one more link that can help
Hope it helps!