Search code examples
amazon-web-serviceselasticsearchelasticsearch-phonetic

Elastic Search with phonetic search


Im trying to get Elastic Search making a phonetic search in a list of cities. My goal is to find matching results even if the user uses an incorrect spelling.

I've done the following steps:

  1. Remove domain

    curl -X DELETE "localhost:9200/city/"
    
  2. Create new domain

    curl -X PUT "localhost:9200/city/?pretty" -H 'Content-Type: application/json' -d'                                                      
    {
      "settings": {
        "index": {
          "analysis": {
            "analyzer": {
              "my_analyzer": {
                "tokenizer": "standard",
                "filter": [
                  "lowercase",
                  "my_metaphone"
                ]
              }
            },
            "filter": {
              "my_metaphone": {
                "type": "phonetic",
                "encoder": "metaphone",
                "replace": true
              }
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "name": {
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      }
    }'
    
  3. Fill some sample data

    curl -X PUT "localhost:9200/city/_doc/1?pretty" -H 'Content-Type: application/json' -d'
    {
       "name":"Mayrhofen"
    }
    '
    
    curl -X PUT "localhost:9200/city/_doc/2?pretty" -H 'Content-Type: application/json' -d'
    {
       "name":"Ischgl"
    }
    '
    
    curl -X PUT "localhost:9200/city/_doc/3?pretty" -H 'Content-Type: application/json' -d'
    {
       "name":"Saalbach"
    }
    '
    
  4. Search in the cities - here I get an result

    curl -X GET ""localhost:9200/city/_search?pretty" -H 'Content-Type: application/json' -d'
    {
       "query":{
          "query_string":{
             "query":"Mayrhofen" 
          }
       }
    }
    '
    

I tried the query with Mayerhofen and expected the same result as using Mayrhofen. The same issue with Ischgl and Ichgl or Saalbach and Salbach.

Where's my error? Is something mssing?


Solution

  • Problem is that you are using wrong encoder. metaphone cannot match those.

    What you need to use is double_metaphone for your inputs. It's based on phonetic algorithm implementation. I would suggest you to understand your data and algorithm to ensure whether the phonetic algorithm is best fit for your purpose.

    Mapping:

    {
          "analysis": {
            "analyzer": {
              "double_meta_true_analyzer": {
                "tokenizer": "standard",
                "filter": [
                  "lowercase",
                  "true_doublemetaphone"
                ]
              }
            },
            "filter": {
              "true_doublemetaphone": {
                "type": "phonetic",
                "encoder": "double_metaphone",
                "replace": true
              }
            }
          }
        }
    

    It matches the docs.

    Why metaphone is not matching:

    GET http://localhost:9200/city2/_analyze
    {
       "field":"meta_true",
       "text":"Mayrhofen"
    }
    

    yields

    {
        "tokens": [
            {
                "token": "MRHF",
                "start_offset": 0,
                "end_offset": 9,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }
    

    And analysing below

    {
       "field":"meta_true",
       "text":"Mayerhofen"
    }
    

    yields

    {
        "tokens": [
            {
                "token": "MYRH",
                "start_offset": 0,
                "end_offset": 10,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }
    

    Double_Metaphone works the below way:

    GET
    {
       "field":"doublemeta_true",
       "text":"Mayerhofen"
    }
    

    And

    {
       "field":"doublemeta_true",
       "text":"Mayerhofen"
    }
    

    and

    {
       "field":"doublemeta_true",
       "text":"Mayrhofen"
    }
    

    yields

    {
        "tokens": [
            {
                "token": "MRFN",
                "start_offset": 0,
                "end_offset": 10,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }