How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that
"David Letterman" and "דוד לטרמן" are strings that sound alike.
-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..
The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.
I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?
Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.
You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.
Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.
However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:
To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.
Similar mappings can be made for Greek:
or Russian:
(Note that some of the 2's might be 32's, depending on your transliteration convention.)
A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.
For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.