Search code examples
elasticsearchphoneticsmetaphone

How to decide which Encoder to use for which language in Elasticsearch "Phonetic Token filter"?


I have used Metaphone and soundex Encoder with "Phonetic Token Filter" in Elasticsearch.

Metaphone is good for English words.

Soundex is good for English as well as Hindi maybe many other languages as well.

I want to know which of these encoders is best optimized for Hindi and if possible other Indian languages?

  • Soundex
  • Metaphone
  • double_metaphone
  • refined_soundex
  • caverphone1 - English (New Zealand localised)
  • caverphone2 - English (New Zealand localised)
  • cologne - German
  • nysiis - Improvized Soundex
  • koelnerphonetik - German
  • haasephonetik - German
  • beider_morse - English and multiple European Language
  • daitch_mokotoff - Slavic & Yiddish Surname

As This is not listed on Elasticsearch website for which Language we should choose which Encoder.

Also tell me which of the Encoders have you already used and for which language.


Solution

  • Phonetic encoders are alogorithms for indexing words by their pronunciation.

    Explanation for this is available on wikipedia

    1. Metaphone, Double Metaphone, and Metaphone 3 : suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers. The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm.
    2. Soundex: which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.
    3. Daitch–Mokotoff Soundex: which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.
    4. Cologne phonetics :This is similar to Soundex, but more suitable for German words.
    5. New York State Identification and Intelligence System (NYSIIS): which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding.
    6. Match Rating Approach developed by Western Airlines in 1977: this algorithm has an encoding and range comparison technique.
    7. Caverphone: created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand

    References: Details of above algorithms and their subtypes us available in below wikipedia page 1. https://en.wikipedia.org/wiki/Phonetic_algorithm

    Among above SoundEx is most suitable for Indian languages You can check below resources for same 1. Phonetic search for Indian languages 2. https://thottingal.in/blog/2009/07/26/indicsoundex/