I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.
I am thinking to adopt the following approach:
Then the highest score 1.0 means an exact match and 0.0 means no match.
But I see variations here: two whole texts similarity using levenshtein distance where 1- distance(a,b)/max(a.length, b.length)
Difference in normalization of Levenshtein (edit) distance?
Explanation of normalized edit distance formula
I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text
only implements LevenshteinDistance and not normalized LevenshteinDistance.
Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:
private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
if (s1.length() >= s2.length()) {
return levenshtein / s1.length();
}
else {
return levenshtein / s2.length();
}
}
After 3 days, once this has been thoroughly ripped to shreds, I'll add it as a Github issue on commons-text.