I have a java list (it could be a map if it is necessary) with a lot of strings.
I want to find the most similar strings for another given string in an efficient way.
I think I should use the Levenshtein distance, but I don't want to iterate through all the list.
Do you think that it is a good idea to divide the main list in some pieces with a common prefix?
Then I would have a map with prefixes as the key and with a list as the value:
In this way I could search quickly the strings with the same prefix that the searched one. Then I could apply the Levenshtein distance only for some strings and not for all the main-list.
Is it a good idea? Thanks
You could once calculate the soundex code of every entry, and map soundex to list of original word. Soundex is a reducing code to get one single key for similar sounding words.
Map<String, Set<String>> soundexToWords = ...
for (String word : words) {
String sdex = soundex(word);
Set<String> similarWords = soundexToWords.get(sdex));
if (similarWords == null) {
similarWords = new HashSet<>();
soundexToWords.put(sdex, similarWords);
}
similarWords.add(word);
}
Set<String> similarWords(String word) {
return soundexToWords.get(soundex(word));
}
Soundex is typically for one language, say English, and especially for English is quite reducive.