Search code examples
elasticsearchstemming

Terms get truncated after indexing document (Elasticsearch)


I'm new to elasticsearch, and all I did was index some documents. Then on retrieving the term vectors, I noticed that there are quite a few terms that are truncated, here is a small example

        "nationallypublic": {
           "term_freq": 1,
           "tokens": [
              {
                 "position": 496,
                 "start_offset": 3126,
                 "end_offset": 3146
              }
           ]
        },
        "natur": {
           "term_freq": 1,
           "tokens": [
              {
                 "position": 60,
                 "start_offset": 373,
                 "end_offset": 380
              }
           ]
        },

These are some excerpts of the document, this one contains naturally

are some of the filmmakers ofthe 80s Its natural said Robert Friedman the senior vicepresident of worldwide advertising and publicity at Warner Bros

and another for nationallypublic (I know this is wrong word, but even then it should be included in complete) which should have been nationallypublicized

They were reported missing on June 21 several hours after beingstopped for speeding near Philadelphia Miss After a nationallypublicized search their bodies were discovered Aug 4 on a farmjust outside the town

I wonder if I did something wrong? Here's my settings and mappings

{
   "ap1": {
      "mappings": {
         "document": {
            "properties": {
               "docno": {
                  "type": "string",
                  "index": "not_analyzed",
                  "store": true
               },
               "text": {
                  "type": "string",
                  "store": true,
                  "term_vector": "with_positions_offsets_payloads",
                  "analyzer": "my_english"
               }
            }
         }
      },
      "settings": {
         "index": {
            "creation_date": "1422144472984",
            "uuid": "QzT_sx4aRWOXGlEs2ATibw",
            "analysis": {
               "analyzer": {
                  "my_english": {
                     "type": "english",
                     "stopwords": "_none_"
                  }
               }
            },
            "store": {
               "type": "default"
            },
            "number_of_replicas": "0",
            "number_of_shards": "1",
            "version": {
               "created": "1040299"
            }
         }
      }
   }
}

Solution

  • This is the effect of the stemmer. By default the snowball stemmer is also used as analyzer. The expected behaviour of stemmer is to convert words to its base form , like below -

    Jumping => jump
    Running = > run
    

    And so on. snowball stemmer works on an algorithm to convert words to its base form. This means that the conversion might not be very accurate as in it will convert the token into a word that might represent the base form but not exactly base form. So effectively the following version happens while indexing and search

    jumping => jmp
    jump    => jmp
    jumped  => jmp
    

    And hence we are able to do successful stemming but there are corner cases where this is not accurate.

    The token transformation you are seeing is not truncation but transformation done by snowball algorithm for stemming.

    If you want accurate tokens here , a good idea would be to use hunspell which is dictionary based and hence will slow the search side.