Search code examples
elasticsearchelasticsearch-py

Really huge query or optimizing an elasticsearch update


I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.

I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match

As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:

"query": {
    "bool": {
      "should": [
           {"match": {"id_str": "939490553510748161"}},
           {"match": {"id_str": "939496983510742348"}}
           ...
        ],
      "minimum_should_match" : 1
    }
},
"aggs" : { ... }

So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:

for id in docs_ids:
    es.update(
        index=kwargs["index"],
        doc_type=kwargs["doc_type"],
        id=id,
        body={"doc": {
            "prediction": kwargs["category"]
        }}
    )

Do you know an alternative to update the predictions faster?


Solution

  • You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches. Try:

    from elasticsearch import helpers
    
    query_list = []
    list_ids = ["1","2","3"]
    es = ElasticSearch("myurl")
    for id in list_ids:
        query_dict ={
        '_op_type': 'update',
        '_index': kwargs["index"],
        '_type': kwargs["doc_type"],
        '_id': id,
        'doc': {"prediction": kwargs["category"]}
        }
        query_list.append(query_dict)
    
    helpers.bulk(client=es,actions=query_list)
    

    Please have a read here Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:

    my_ids_list = [<some_ids_here>]
    es.mget(index = kwargs["index"],
                    doc_type = kwargs["index"],
                    body = {'ids': my_ids_list})