I have a large ES index which I intend to populate using various sources. The sources sometimes have the same documents, meaning that I will have duplicate docs differing only by 'source' param.
To perform de-duplication when serving searches, I see 2 ways:
I prefer not to filter at Python level to preserve pagination, so I want to ask if there's a way to tell Elasticsearch to priority filter based on some value in the document (in my case, source).
I want to filter by simple priority (so if my order is A,B,C, I will serve the A document if it exists, then B if doc from source A doesn't exist, followed by C).
An example set of duplicate docs would look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
},
{
"id": 1,
"source": "B",
"rest_of": "data",
...
},
{
"id": 1,
"source": "C",
"rest_of": "data",
...
}
But if I want to serve "A" FIRST, then "B" if no "A", followed by "C" if no "B", a search result for "id": 1 will look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
}
Note: Alternatively, I could try to de-duplicate at the population phase, but I'm worried about the performance. Willing to explore this if there's no trivial way to implement solution 1.
I think the best solution is to actually avoid to have duplicates in your index. I don't know how frequent it will be in your data, but if you have lot of them, this will badly influence the term frequencies and may lead to poor search relevance.
A quite simple approach could be to generate the ElasticSearch ID of the document, with a consistent method across all sources. You can indeed force the _id
when indexing instead of letting ES generate it for you.
What will happen then is that last source coming will override the existing one if it exists. Last to come wins. If you don't care about the source
, this may work.
However, this comes with a little performance cost, as stated in this article:
As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch. The type and structure of the identifier can have a significant impact on indexing performance. This will however vary from use case to use case so it is recommended to benchmark to identify what is optimal for you and your particular scenario.