Search code examples
multithreadingelasticsearchelasticsearch-py

ElasticSearch Scroll API with multi threading


First of all, I want to let you guys know that I know the basic work logic of how ElasticSearch Scroll API works. To use Scroll API, first, we need to call search method with some scroll value like 1m, then it will return a _scroll_id that will be used for the next consecutive calls on Scroll until all of the doc returns within loop. But the problem is I just want to use the same process on multi-thread basis, not on serially. For example:

If I have 300000 documents, then I want to process/get the docs this way

  • The 1st thread will process initial 100000 documents
  • The 2nd thread will process next 100000 documents
  • The 3rd thread will process remaining 100000 documents

So my question is as I didn't find any way to set the from value on scroll API how can I make the scrolling process faster with threading. Not to process the documents in a serialized manner.

My sample python code

if index_name is not None and doc_type is not None and body is not None:
   es = init_es()
   page = es.search(index_name,doc_type, scroll = '30s',size = 10, body = body)
   sid = page['_scroll_id']
   scroll_size = page['hits']['total']

   # Start scrolling
   while (scroll_size > 0):

       print("Scrolling...")
       page = es.scroll(scroll_id=sid, scroll='30s')
       # Update the scroll ID
       sid = page['_scroll_id']

       print("scroll id: " + sid)

       # Get the number of results that we returned in the last scroll
       scroll_size = len(page['hits']['hits'])
       print("scroll size: " + str(scroll_size))

       print("scrolled data :" )
       print(page['aggregations'])

Solution

  • scroll must be synchronous, this is the logic.

    You can use multi thread, this is exactly why elasticsearch is good for: parallelism.

    An elasticsearch index, is composed of shards, this is the physical storage of your data. Shards can be on the same node or not (better).

    Another side, the search API offers a very nice option: _preference(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html)

    So back to your app:

    1. Get the list of index shards (and nodes)
    2. Create a thread by shard
    3. Do the scroll search on each thread

    Et voilà!

    Also, you could use the elasticsearch4hadoop plugin, which do exactly that for Spark / PIG / map-reduce / Hive.