Search code examples
stringlevenshtein-distance

Levenshtein distance for non-english languages


Which other languages besides English does Levenshtein distance support its implementation?

I know that the language needs to have single-character based representation (and not two or more characters being treated as a single entity like in Dutch?), and wanted to know which languages fall/do not fall in this category.

Thanks Abhishek S


Solution

  • Levenshtein distance is defined on arbitrary strings, in the mathematical sense; it is not language-specific. You should just make sure you're computing it on the right level of representation, the sensible default being the level of Unicode code points after some normalization. If the language you're handling always needs two symbols to represent anything meaningful, then compute Levenshtein distance on pairs of symbols.

    [I'm not sure what you mean by multiple characters being "a single entity" in Dutch but if you mean the ij ligature, that has never stopped me from applying Levenshtein to Dutch text :)]