Search code examples
elasticsearchmorelikethis

Elasticsearch: How to store term vectors


I am working on a project where I heavily use Elasticsearch and leverage the moreLikeThis query to implement some features. The official documentation for the MLT query states the following:

In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.

In the **How it works* section. The idea now is then to tune the mapping so store the pre calculated term vectors. The problem is that it seems unclear from the documentation how exactly this should be done. On one side, in the MLT documentation, they provide and example mapping that looks like this:

curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
  "mappings": {
    "movies": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "yes"
         },
         "description": {
          "type": "string"
        },
        "tags": {
          "type": "string",
          "fields" : {
            "raw": {
              "type" : "string",
              "index" : "not_analyzed",
              "term_vector" : "yes"
            }
          }
        }
      }
    }
  }
}

On the other side, in the Term Vectors documentation, they provide a mapping in the Example 1 section that looks like this

curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "index_analyzer" : "fulltext_analyzer"
        }
      }
    }
    ....

This should create an index that stores term vectors, payloads etc.

Now the question is: which of the mapping should be used? Is it a flaw in the documentation or am I missing something?


Solution

  • You are right it doesn't seem to be explicitly mentioned in the current version of documents however in the upcoming release 2.0 documents there is a more detailed explanation.

    Term vectors contain information about the terms produced by the analysis process, including:

    • a list of terms.
    • the position (or order) of each term.
    • the start and end character offsets mapping the term to its origin in the original string.

    These term vectors can be stored so that they can be retrieved for a particular document.

    The term_vector setting accepts:

    • no: No term vectors are stored. (default)
    • yes: Just the terms in the field are stored
    • with_positions: Terms and positions are stored
    • with_offsets: Terms and character offsets are stored
    • with_positions_offsets: Terms, positions, and character offsets are stored