I am using both Daitch-Mokotoff soundexing and Damerau-Levenshtein to find out if a user entry and a value in the application are "the same".
Is Levenshtein distance supposed to be used as an absolute value? If I have a 20 letter word, a distance of 4 is not so bad. If the word has 4 letters...
What I am now doing is taking the distance / length to get a distance that better reflects what percentage of the word has been changed.
Is that a valid/proven approach? Or is it plain stupid?
It seems like it would depend on your requirements. (To clarify: Levenshtein distance is an absolute value, but as the OP pointed out, the raw value may not be as useful as for a given application as a measure that takes the length of the word into account. This is because we are really more interested in similarity than distance per se.)
Sounds like you're trying to determine whether the user intended their entry to be the same as a given data value?
Are you doing spell-checking? or conforming invalid input to a known set of values? What are your priorities?
You might end up using the Levenshtein distance in one way to determine whether a word should be offered in a suggestion list; and another way to determine how to order the suggestion list.
It seems to me, if I've inferred your purpose correctly, that the core thing you want to measure is similarity rather than difference between two strings. As such, you could use Jaro or Jaro-Winkler distance, which takes into account the length of the strings and the number of characters in common:
The Jaro distance dj of two given strings s1 and s2 is
(m / |s1| + m / |s2| + (m - t) / m) / 3
- m is the number of matching characters
- t is the number of transpositions
Jaro–Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l.