unicode match soundex similarity phonetics

compare short strings in different languages for similar sound - is Soundex the answer?

How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that

"David Letterman" and "דוד לטרמן" are strings that sound alike.

-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..

The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.

I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?

Solution

Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.

You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.

Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.

בוףפ → 1
גזחךכסקש → 2
דטת → 3
ץצ → 32
ל → 4
םמןנ → 5
ר → 6
אהיע → ignored

However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:

ו is mapped to 1 (like English V) in the list above, but it often represents O, U, or W, in which case it should be ignored in Soundex.
ח is hard to classify due to its lack of an English equivalent. I put it in category 2 because this (1) matches the "ch" transliteration, and (2) allows ך/כ to have the same category with or without a dagesh.
Ashkenazi pronuncation would split ת between categories 2 and 3.

To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.

Similar mappings can be made for Greek:

ΒΠΦ → 1
Ψ → 12
ΓΖΚΞΣΧ → 2
ΔΘΤ → 3
Λ → 4
ΜΝ → 5
Ρ → 6
ΑΕΗΙΟΥΩ → ignored

or Russian:

БВПФ → 1
ГЖЗКСХЧШЩ → 2
ДТ → 3
Ц → 32
Л → 4
МН → 5
Р → 6
АЕЁИЙОУЪЫЬЭЮЯ → ignored

(Note that some of the 2's might be 32's, depending on your transliteration convention.)

A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.

For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.