Search code examples
elasticsearchnestironworker

Updating field of million of document from a worker


Currently I have to update a field in over 1 million documents indexed in elasticsearch. This is a complex task due to this field contains metadata generated from XML files, evaluating xpath expressions. We have to loop over all the documents in the index and update this field. So, in order to avoid overkill the system, we decide to use the ironworker platform.
I have read several post about how to update millions of docs in elasticsearch, like this one, but given that we are gonna use ironworkers there are some restrictions, like a task can only run for 60 minutes.

Question: How I loop over all the documents and update its fields, considering the restriction of 60 min.
I thought opening and scroll and pass the scroll_id to the next worker, but I don't have an idea of how long will take to execute the next task, so the scroll could expire and I will have to start all over.


Solution

  • It sounds from your description that you could chain together IronWorker tasks, which is actually very easy. If you have some idea of how long it takes to get through updating a single item, then you could extrapolate how long you need. Let's say it took 100ms to update an item, then you could do 10 per second, or 600 per minute so maybe do 6000 (which should take about 10 minutes), then queue up the next one from your code. Queuing up the next task is just as easy as queuing up the first task: http://dev.iron.io/worker/reference/api/#queue_a_task (can use the client library for your language too).

    Or just stop after X minutes and queue up the next worker.

    Or if you want to make things faster, how about queue up 26 at the same time, one for each letter of the alphabet? Each one can query for all the items starting with the letter it's assigned to (Prefix Query ) .

    There's many ways to slice this problem.