Search code examples
javaspringspring-bootelasticsearchspring-data-elasticsearch

How to find and mark duplicates in Elasticsearch


I have two ES indices which contain data about people (name, birth_date etc). There are people which are present in both indices, for example:

index1

_id first_name last_name birth_date ...
qqwew demo demo 1998.10.10
etroty demo2 demo2 1995.11.11
werewr demo3 demo3 1997.09.09

index2

_id first_name last_name birth_date ...
sdfll demo514 demo514 2001.11.04
fdgdg demo2 demo2 1995.11.11
sdfdfg demo512 demo512 2000.05.16

As you can see, this entry is contained in both indices (compared by first_name, last_name & birth_date):

_id first_name last_name birth_date ...
id is different demo2 demo2 1995.11.11

I need to find such entries and add an additional field with unique id in it, so index1 & index2 should look like this afterwards:

index1

_id first_name last_name birth_date unique_id
qqwew demo demo 1998.10.10 null
etroty demo2 demo2 1995.11.11 QWERTY
werewr demo3 demo3 1997.09.09 null

index2

_id first_name last_name birth_date unique_id
sdfll demo514 demo514 2001.11.04 null
fdgdg demo2 demo2 1995.11.11 QWERTY
sdfdfg demo512 demo512 2000.05.16 null

My data comes as CSV files which are parsed & imported into ES (via Java). I'm not sure on which stage I should do things like this or whether it's even possible with ES


Solution

  • For those who wondering how I solved this - I did not. The best solution is hashing, but it does not completely suit to my needs.