Search code examples
postgresqlelasticsearchconcurrencyoptimistic-lockingpessimistic-locking

How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?


I'm working on a solution in which we have two databases that are used with the following purposes:

  • An Elasticsearch used for search purposes
  • A Postgres database that acts as a source of truth for the data

Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:

  • Retrieve products by name: a search is performed using elasticsearch, and then the IDs retrieved by ES are used on a secondary query against Postgres to obtain the actual and trustworthy data (so we get fast searches on big tables while getting trustworthy data)
  • Update product fields: We allow users to update any product information (kind of a collaborative wiki). First we store the data in Postgres, and then into Elasticsearch.

However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.

So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.

What techniques/options are available to me to solve my kind of problem?


Solution

  • Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.

    Immediate Consistency and of course eventual consistency

    From Application layer

    For the immediate consitency between two databases, the only way you can achieve them is to do polygot persistence in a transactional way so that either the same data in both Postgres and Elasticearch gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.

    So basically GUI --> Application Layer --> Postgres/Elasticsearch

    Queue/Real Time Streaming mechanism

    You need to have a messaging queue so that the updates are going to the Queue in the event based approached.

    GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch
    

    Eventual consistency but not immediate consistency

    Have a separate application, normally let's call this indexer. The purpose of this tool is to carry out updates from postgres and push them into the Elasticsearch.

    What you can have in the indexer is have multiple single configuration per source which would have

    • An option to do select * and index everything into Elasticsearch or full crawl
      • This would be utlized when you want to delete/reindex entire data into Elasticsearch
    • Ability to detect only the updated rows in the Postgres and thereby push them into Elasticsearch or incremental crawl
      • For this you would need to have a select query with where clause based on the status on your postgres rows for e.g. pull records with status 0 for document which got recently updated, or based on timestamp to pull records which got updated in last 30 secs/1 min or depending on your needs. Incremental query
      • Once you perform the incremental crawl, if you implement incremental using status, you need to change the status of this to 1(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental query
      • Basically schedule jobs to run above queries as part of indexing operations.

    Basically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch

    Summary

    I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.

    Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.

    Also I recommend one more link that can help

    Hope it helps!