Search code examples
comparefuzzylinguisticslevenshtein-distancefuzzy-comparison

Damerau–Levenshtein distance for language specific quirks


To Dutch speaking people the two characters "ij" are considered to be a single letter that is easily exchanged with "y".

For a project I'm working on I would like to have a variant of the Damerau–Levenshtein distance that calculates the distance between "ij" and "y" as 1 instead of the current value of 2.

I've been trying this myself but failed. My problem is that I do not have a clue on how to handle the fact that both texts are of different lengths. Does anyone have a suggestion/code fragment on how to solve this?

Thanks.


Solution

  • The Wikipedia article is rather loose with terminology. There are no such things as "strings" in "natural language". There are phonemes in natural language which can be represented by written characters and character-combinations.

    Some character-combinations are vestiges of historical conventions which have survived into modern times, as in modern English "rough" where the "gh" can sound like -f- or make no sound at all. It seems to me that in focusing on raw "strings" the algorithm must be agnostic about the historical relationship of language and orthographic convention, which leads to some arbitrary metrics whenever character-combinations correlate to a single phoneme. How would it measure "rough" to "ruf"? Or "through" to "thru"? Or German o-umlaut to "oe"?

    In your case the -y- can be exchanged phonetically and orthographically with -ij-. So what is that according to the algorithm, two deletions followed by an insertion, or a single deletion of the -j- or of the -i- followed by a transposition of the remaining character to -y-? Or is -ij- being coalesced and the coalescence is followed by a transposition?

    I would recommend that you use another unused comnbining character for -ij- before applying the algorithm, perhaps U00EC, Latin small letter i with grave accent.

    How does the algorithm handle multi-codepoint characters?