Search code examples
javalucenenosqlelasticsearch

Efficiency of updating documents in elasticsearch


I have been using elasticsearch successfully now for a year or so whereby I have been loading millions of documents and running various queries and facets against the data.

I have recently been asked by some of my users if it is possible to 'mark documents as read' and thus they can be excluded from search results.

I have successfully implemented this without issue, but now I'm wondering if I have chosen the best implementation. My understanding is that updating a document in ES(or any lucene implementation) is in effect the same as deleting and re-indexing.

My question to the lucene/ES community... Will their be any negative impacts as a result of updating documents as a user driven adhoc task? (And can you suggest an alternative?)

Thanks, JayTee


Solution

  • Yes, there will be a performance overhead for re-indexing. This is given as "non-negligible" at https://www.elastic.co/blog/managing-relations-inside-elasticsearch (here its talking about a nested doc, but updating a field on a normal (a doc without nested fields) is the same

    "If your data changes often, nested documents can have a non-negligible overhead associated with reindexing."

    An alternative is given later in that article - namely Parent/Child

    "Parent/Child removes this limitation by separating the two documents and only loosely coupling them... means you are more free to update/delete children docs, since they have no effect on the parent or other children.

    The downside is ...(queries).. aren’t quite as fast .. since they are not colocated in the same Lucene block."

    So if every doc you have will eventually be updated to "read" - that will involve the overhead of re-indexing your entire datastore. If thats going to happen slowly over time, maybe you architecture can handle it.

    If you are concerned that a high number of docs could be marked as read, and that will create a large load on your system, you can use a parent child relationship for the read field. But there will be (as I understand a minor) extra overhead to run the query "only give docs where the child field 'read' is false"