Search code examples
javaperformancesimilaritylevenshtein-distance

Java: Find similar strings


I have a java list (it could be a map if it is necessary) with a lot of strings.

  • List(hello,hell,car,cartoon,...)

I want to find the most similar strings for another given string in an efficient way.

I think I should use the Levenshtein distance, but I don't want to iterate through all the list.

Do you think that it is a good idea to divide the main list in some pieces with a common prefix?

Then I would have a map with prefixes as the key and with a list as the value:

  • hel -> List(hello,hell,...)
  • car -> List(car,cartoon,...)

In this way I could search quickly the strings with the same prefix that the searched one. Then I could apply the Levenshtein distance only for some strings and not for all the main-list.

Is it a good idea? Thanks


Solution

  • You could once calculate the soundex code of every entry, and map soundex to list of original word. Soundex is a reducing code to get one single key for similar sounding words.

    Map<String, Set<String>> soundexToWords = ...
    for (String word : words) {
        String sdex = soundex(word);
        Set<String> similarWords = soundexToWords.get(sdex));
        if (similarWords == null) {
            similarWords = new HashSet<>();
            soundexToWords.put(sdex, similarWords);
        }
        similarWords.add(word);
    }
    
    Set<String> similarWords(String word) {
        return soundexToWords.get(soundex(word));
    }
    

    Soundex is typically for one language, say English, and especially for English is quite reducive.