Search code examples
elasticsearchtokenize

Elasticsearch tokenization for international languages


I wanted to find out how elasticsearch tokens the languages other than english and i tried out the analyze api provided by it. But I cannot understand the output at all. Take for example

GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "

Now in the above text there are 6 words in total so I expect at max 6 tokens( believing that text contains no stop words) but the output is somewhat like this

 {
   "tokens": [
      {
         "token": "2350",
         "start_offset": 3,
         "end_offset": 7,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "2375",
         "start_offset": 10,
         "end_offset": 14,
         "type": "<NUM>",
         "position": 2
      },
      {
         "token": "2306",
         "start_offset": 17,
         "end_offset": 21,
         "type": "<NUM>",
         "position": 3
      },
      {
         "token": "2325",
         "start_offset": 25,
         "end_offset": 29,
         "type": "<NUM>",
         "position": 4
      },
      {
         "token": "2361",
         "start_offset": 32,
         "end_offset": 36,
         "type": "<NUM>",
         "position": 5
      },
      {
         "token": "2340",
         "start_offset": 39,
         "end_offset": 43,
         "type": "<NUM>",
         "position": 6
      },
      {
         "token": "2366",
         "start_offset": 46,
         "end_offset": 50,
         "type": "<NUM>",
         "position": 7
      },
      {
         "token": "2361",
         "start_offset": 54,
         "end_offset": 58,
         "type": "<NUM>",
         "position": 8
      },
      {
         "token": "2370",
         "start_offset": 61,
         "end_offset": 65,
         "type": "<NUM>",
         "position": 9
      },
      {
         "token": "2305",
         "start_offset": 68,
         "end_offset": 72,
         "type": "<NUM>",
         "position": 10
      },
      {
         "token": "2324",
         "start_offset": 76,
         "end_offset": 80,
         "type": "<NUM>",
         "position": 11
      },
      {
         "token": "2352",
         "start_offset": 83,
         "end_offset": 87,
         "type": "<NUM>",
         "position": 12
      },
      {
         "token": "2340",
         "start_offset": 91,
         "end_offset": 95,
         "type": "<NUM>",
         "position": 13
      },
      {
         "token": "2369",
         "start_offset": 98,
         "end_offset": 102,
         "type": "<NUM>",
         "position": 14
      },
      {
         "token": "2350",
         "start_offset": 105,
         "end_offset": 109,
         "type": "<NUM>",
         "position": 15
      },
      {
         "token": "2360",
         "start_offset": 113,
         "end_offset": 117,
         "type": "<NUM>",
         "position": 16
      },
      {
         "token": "2369",
         "start_offset": 120,
         "end_offset": 124,
         "type": "<NUM>",
         "position": 17
      },
      {
         "token": "2344",
         "start_offset": 127,
         "end_offset": 131,
         "type": "<NUM>",
         "position": 18
      },
      {
         "token": "2344",
         "start_offset": 134,
         "end_offset": 138,
         "type": "<NUM>",
         "position": 19
      },
      {
         "token": "2366",
         "start_offset": 141,
         "end_offset": 145,
         "type": "<NUM>",
         "position": 20
      }
   ]
}

That means instead of six elasticsearch has detected around 20 tokens and all of type NUM(I don't know what's that) I am really confused why this is happening. Can someone enlighten me what is happening. What am I doing doing wrong or where I lack in my understanding?


Solution

  • How are you calling the elasticsearch API - possibly the Hindi characters are getting messed up by your client?

    It works okay for me (at least the Hindi chars are appearing in the result) on Linux with curl:

    curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
    {
      "tokens" : [ {
        "token" : "कह",
        "start_offset" : 4,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 2
      }, {
        "token" : "हुं",
        "start_offset" : 9,
        "end_offset" : 12,
        "type" : "<ALPHANUM>",
        "position" : 3
      }, {
        "token" : "तुम",
        "start_offset" : 16,
        "end_offset" : 19,
        "type" : "<ALPHANUM>",
        "position" : 5
      }, {
        "token" : "सुन",
        "start_offset" : 20,
        "end_offset" : 25,
        "type" : "<ALPHANUM>",
        "position" : 6
      } ]
    }