Search code examples
elasticsearchanalyzerstop-words

How to store what is generated by the analyser?


Let's say that I use this mapping:

PUT test
{
  "settings" : {
    "index" : {
        "number_of_shards" : 1, 
        "number_of_replicas" : 0
    }
  },
  "mappings": {
    "testtype": {
      "properties": {
        "content": {
          "type":     "text",
          "analyzer": "english",
          "store": true
        }
      }
    }
  }
}

Now I can index a document:

PUT test/testtype/0
{
   "content": "The Quick Brown Box"
}

And I can retrieve it:

GET test/testtype/0

Which will return me:

{
  "_index": "test",
  "_type": "testtype",
  "_id": "0",
  "_version": 1,
  "found": true,
  "_source": {
    "content": "The Quick brown Fox"
  }
}

I know that in the source field you are supposed to only have the document that you inserted this is why I specified in my mapping that I want to store my content field. So by querying my store field I would expect to have in it what is generated my the analyser so something like this:

"quick brown fox"

But when I query the stored field:

GET test/testtype/_search
{
  "stored_fields": "content" 
}

I have exactly what I wrote in my document:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "testtype",
        "_id": "0",
        "_score": 1,
        "fields": {
          "content": [
            "The Quick brown Fox"
          ]
        }
      }
    ]
  }
}

So my question is how can I store in my elasticsearch the result of what is generated by my analyser?


Solution

  • You question is about the difference between the stored text and the generated tokens: the store attribute of a lucene field

    A stored field contains exactly the same as the corresponding field in the "_source"-JSON.

    The generated token are in a lucene internal representation. But you can use the _analyze or _termvectors endpoint to have see the token or you can use the term-aggregation