Search code examples
elasticsearchluceneanalyzer

Elasticsearch Token Position change


recently I am taking interest in Elasticsearch analyzer.I understand what is token graph,start_offset,end_offset,position and positionLength.

Index schema

PUT synonym_graph_index
{
"settings": {
  "number_of_replicas": 0,
  "analysis": {
    "analyzer": {
      "synonym_graph_analyzer":{
        "type":"custom",
        "tokenizer":"standard",
        "filter":["synonym_filter"]
      }
    },
    "filter": {
      "synonym_filter":
      {
        "type":"synonym_graph",
        "synonyms":["wi fi => wifi,hotspot,fast network"]
      }
    }
  }
}, 
"mappings": { 
  "properties": {
    "text_field": {
      "type": "text",
     "analyzer": "synonym_graph_analyzer"
    }
  }
}
}

I add a document in it.

POST synonym_graph_index/_analyze
{
  "analyzer": "synonym_graph_analyzer"
  , "text": "Airtel wi fi is up and down"
}

Result of analysis

{
  "tokens" : [
    {
      "token" : "Airtel",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wifi",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "hotspot",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "fast",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "network",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "up",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "and",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "down",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 6
    }
  ]
}

to understand better i made table. test

By using above table i made the graph also. graph

the network token has change its position.Did it happen because i used standard tokenizer and it split fast network.And one more thing i would like to know that in some case positionlength is not mention.


Solution

    1. Yes, synonyms just "replace" input with outputs, they don't affect processing (tokenization, stemming, etc) downstream.
    2. Your original string "wi fi" has 2 tokens but some of the synonyms ("hotspot") are single word so they have positionLength to indicate that this token occupies 2 positions.