Search code examples
unicodematchsoundexsimilarityphonetics

compare short strings in different languages for similar sound - is Soundex the answer?


How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that

"David Letterman" and "דוד לטרמן" are strings that sound alike.

-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..

The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.

I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?


Solution

  • Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.

    You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.

    Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.

    • בוףפ → 1
    • גזחךכסקש → 2
    • דטת → 3
    • ץצ → 32
    • ל → 4
    • םמןנ → 5
    • ר → 6
    • אהיע → ignored

    However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:

    • ו is mapped to 1 (like English V) in the list above, but it often represents O, U, or W, in which case it should be ignored in Soundex.
    • ח is hard to classify due to its lack of an English equivalent. I put it in category 2 because this (1) matches the "ch" transliteration, and (2) allows ך/כ to have the same category with or without a dagesh.
    • Ashkenazi pronuncation would split ת between categories 2 and 3.

    To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.

    Similar mappings can be made for Greek:

    • ΒΠΦ → 1
    • Ψ → 12
    • ΓΖΚΞΣΧ → 2
    • ΔΘΤ → 3
    • Λ → 4
    • ΜΝ → 5
    • Ρ → 6
    • ΑΕΗΙΟΥΩ → ignored

    or Russian:

    • БВПФ → 1
    • ГЖЗКСХЧШЩ → 2
    • ДТ → 3
    • Ц → 32
    • Л → 4
    • МН → 5
    • Р → 6
    • АЕЁИЙОУЪЫЬЭЮЯ → ignored

    (Note that some of the 2's might be 32's, depending on your transliteration convention.)


    A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.

    For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.