elasticsearch apache-kafka lucene elasticsearch-7

Duplicated ElasticSearch documents

We use spring boot application to insert/update elastic search documents. Our data provider sends ous data via Kafka. Our app process events, tries to find a record and insert record If not exists or update if received record is different than saved. There shouldn't be any duplicated record in elasticsearch.

App inserts/update documents with IMMEDIATE refresh

Problem: Occasionally we have to remove all data and load them again, becouse there are duplicated records. I found out that these cloned records differs only with insert date. Its usually a few hours difference.

Generally it works as expected, detailed integration tests on org.codelibs.elasticsearch-cluster-runner are green.

Example metadata from elastic search query:

{
  "docs" : [
    {
      "_index" : "reference",
      "_type" : "reference",
      "_id" : "s0z-BHIBCvxpj4TjysIf",
      "_version" : 1,
      "_seq_no" : 17315835,
      "_primary_term" : 40,
      "found" : true,
      "_source" : {
        ...
        "insertedDate" : 1589221706262,
        ...
      }
    },
    {
      "_index" : "reference",
      "_type" : "reference",
      "_id" : "jdVCBHIBXucoJmjM8emL",
      "_version" : 1,
      "_seq_no" : 17346529,
      "_primary_term" : 41,
      "found" : true,
      "_source" : {
...
        "insertedDate" : 1589209395577,
...
      }
    }
  ]
}

Tests

I loaded many times data to local instance of ES - no duplications
I created a few long working integrational tests with big number of inserts, updates, queries on local instance of org.codelibs.elasticsearch-cluster-runner with 1 to 5 nodes in memory- no duplications

Details: Elastic Search version - 7.5 ES connection with org.elasticsearch.client.RestHighLevelClient

Solution

The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.