Search code examples
elasticsearchsearchtokenizen-gramelasticsearch-analyzers

Can't get proper result from elasticsearch based on query and document tokenization


I'm trying to implement a search system in which I need to use Edge NGRAM Tokenizer. Settings for creating an index are shown below. I have used same tokenizer for both documents and search query. (Documents are in Perisan language)

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "autocomplete"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge-ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

The problem shows up when I get 0 hits (results) from searching term 'آلمانی' in docs while I have a doc with data : 'آلمان خوب است'.

As you can see the result for analyzing term 'آلمانی' shows that it generates token 'آلمان' and works properly.

{
  "tokens" : [
    {
      "token" : "آ",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "آل",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "آلم",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "آلما",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "آلمان",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "آلمانی",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

The searching query shown below gets 0 hits.

GET /test/_search
{
  "query": {"match": {
    "title": {"query": "آلمانی" , "operator": "and"}
  }}
}

However searching term 'آلما' returns doc with data 'آلمان خوب است'. How can I fix this problem?

Your assistance would be greatly appreciated.


Solution

  • I found this DevTicks post by Ricardo Heck which solved my problem. enter the link for more detailed description

    I changed my mapping setting like this:

        "mappings": {
        "_doc": {
          "properties": {
            "title": {
              "type": "text",
              "analyzer": "autocomplete",
              "search_analyzer": "autocomplete_search",
              "fields": {
                "ngram": {
                  "type": "text",
                  "analyzer": "autocomplete"
                }
              }
            }
          }
        }
      }
    

    And now I get doc "آلمان خوب است" by searching the term "آلمانی".