Search code examples
elasticsearchelasticsearch-bulk-api

Update nested field for millions of documents


I use bulk update with script in order to update a nested field, but this is very slow :

POST index/type/_bulk

{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}

 ... [a lot more splitted in several batches]

Do you know another way that could be faster ?

It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.


Solution

  • As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.

    In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:

    ... to update a document is to retrieve it, change it, and then reindex the whole document.

    Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.

    You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.

    It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.