We use spring boot application to insert/update elastic search documents. Our data provider sends ous data via Kafka. Our app process events, tries to find a record and insert record If not exists or update if received record is different than saved. There shouldn't be any duplicated record in elasticsearch.
App inserts/update documents with IMMEDIATE refresh
Problem: Occasionally we have to remove all data and load them again, becouse there are duplicated records. I found out that these cloned records differs only with insert date. Its usually a few hours difference.
Generally it works as expected, detailed integration tests on org.codelibs.elasticsearch-cluster-runner
are green.
Example metadata from elastic search query:
{
"docs" : [
{
"_index" : "reference",
"_type" : "reference",
"_id" : "s0z-BHIBCvxpj4TjysIf",
"_version" : 1,
"_seq_no" : 17315835,
"_primary_term" : 40,
"found" : true,
"_source" : {
...
"insertedDate" : 1589221706262,
...
}
},
{
"_index" : "reference",
"_type" : "reference",
"_id" : "jdVCBHIBXucoJmjM8emL",
"_version" : 1,
"_seq_no" : 17346529,
"_primary_term" : 41,
"found" : true,
"_source" : {
...
"insertedDate" : 1589209395577,
...
}
}
]
}
Tests
org.codelibs.elasticsearch-cluster-runner
with 1 to 5 nodes in memory- no duplications Details:
Elastic Search version - 7.5
ES connection with org.elasticsearch.client.RestHighLevelClient
The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.