Search code examples
elasticsearchbigdataknnmorelikethis

Speeding up elasticsearch more_like_this query


I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.

Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.

But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis. But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics. And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index. Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).

Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?


Solution

  • There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.

    For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.

    So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:

    {
      "mappings": {
        "properties": {
          "text": {
            "type":        "text",
            "term_vector": "yes"
          }
        }
      }
    }