I'm new to elasticsearch, and all I did was index some documents. Then on retrieving the term vectors, I noticed that there are quite a few terms that are truncated, here is a small example
"nationallypublic": {
"term_freq": 1,
"tokens": [
{
"position": 496,
"start_offset": 3126,
"end_offset": 3146
}
]
},
"natur": {
"term_freq": 1,
"tokens": [
{
"position": 60,
"start_offset": 373,
"end_offset": 380
}
]
},
These are some excerpts of the document, this one contains naturally
are some of the filmmakers ofthe 80s Its natural said Robert Friedman the senior vicepresident of worldwide advertising and publicity at Warner Bros
and another for nationallypublic (I know this is wrong word, but even then it should be included in complete) which should have been nationallypublicized
They were reported missing on June 21 several hours after beingstopped for speeding near Philadelphia Miss After a nationallypublicized search their bodies were discovered Aug 4 on a farmjust outside the town
I wonder if I did something wrong? Here's my settings and mappings
{
"ap1": {
"mappings": {
"document": {
"properties": {
"docno": {
"type": "string",
"index": "not_analyzed",
"store": true
},
"text": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets_payloads",
"analyzer": "my_english"
}
}
}
},
"settings": {
"index": {
"creation_date": "1422144472984",
"uuid": "QzT_sx4aRWOXGlEs2ATibw",
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stopwords": "_none_"
}
}
},
"store": {
"type": "default"
},
"number_of_replicas": "0",
"number_of_shards": "1",
"version": {
"created": "1040299"
}
}
}
}
}
This is the effect of the stemmer. By default the snowball stemmer is also used as analyzer. The expected behaviour of stemmer is to convert words to its base form , like below -
Jumping => jump
Running = > run
And so on. snowball stemmer works on an algorithm to convert words to its base form. This means that the conversion might not be very accurate as in it will convert the token into a word that might represent the base form but not exactly base form. So effectively the following version happens while indexing and search
jumping => jmp
jump => jmp
jumped => jmp
And hence we are able to do successful stemming but there are corner cases where this is not accurate.
The token transformation you are seeing is not truncation but transformation done by snowball algorithm for stemming.
If you want accurate tokens here , a good idea would be to use hunspell which is dictionary based and hence will slow the search side.