java performance similarity levenshtein-distance

Java: Find similar strings

I have a java list (it could be a map if it is necessary) with a lot of strings.

List(hello,hell,car,cartoon,...)

I want to find the most similar strings for another given string in an efficient way.

I think I should use the Levenshtein distance, but I don't want to iterate through all the list.

Do you think that it is a good idea to divide the main list in some pieces with a common prefix?

Then I would have a map with prefixes as the key and with a list as the value:

hel -> List(hello,hell,...)
car -> List(car,cartoon,...)

In this way I could search quickly the strings with the same prefix that the searched one. Then I could apply the Levenshtein distance only for some strings and not for all the main-list.

Is it a good idea? Thanks

Solution

You could once calculate the soundex code of every entry, and map soundex to list of original word. Soundex is a reducing code to get one single key for similar sounding words.

Map<String, Set<String>> soundexToWords = ...
for (String word : words) {
    String sdex = soundex(word);
    Set<String> similarWords = soundexToWords.get(sdex));
    if (similarWords == null) {
        similarWords = new HashSet<>();
        soundexToWords.put(sdex, similarWords);
    }
    similarWords.add(word);
}

Set<String> similarWords(String word) {
    return soundexToWords.get(soundex(word));
}

Soundex is typically for one language, say English, and especially for English is quite reducive.