Search code examples
javaanagram

Java Anagram running out of memory


I am trying to solve the age old anagram problem. Thanks to many tutorials out there, I am able to iterate through a Set of Strings, recursively find all permutations, then compare them against a list of English words. The problem I am finding is that after about three words (usually on something like "anamorphosis"), I get a OutOfMemory error. I tried breaking my batches into small sets because it appears to be the recursive part consuming all my memory. But even just "anamorphosis" locks it up...

Here I read the words from a file into a List

Scanner scanner = new Scanner(resource.getInputStream());
   while (scanner.hasNext()) {
       String s = scanner.nextLine();
        uniqueWords.add(s.toLowerCase());
   }

Now I break them into smaller sets and call a class to generate anagrams:

List<List<String>> subSets = Lists.partition(new ArrayList(uniqueWords), SET_SIZE);

for (List<String> set: subSets) {
      // tried created as class attribute & injection, no difference 
      AnagramGenerator anagramGenerator = new AnagramGenerator();
      List<Word> anagrams = anagramGenerator.createWordList(set);
      wordsRepository.save(anagrams);
      LOGGER.info("Inserted {} records into the database", anagrams.size());
 }

And finally my generator:

public class AnagramGenerator {

private Map<String, List<String>> map = new Hashtable<>();
public List<Word> createWordList(List<String> dictionary) {

   buildAnagrams(dictionary);

   List<Word> words = new ArrayList<>();
   for (Map.Entry<String, List<String>> entry : map.entrySet()) {
       words.add(new Word(entry.getKey(), entry.getValue()));
   }
    return words;
   }

private Map<String, List<String>> buildAnagrams(List<String> dictionary) {

        for (String str : dictionary) {
            String key = sortString(str);
            if (map.get(key) != null) {
                map.get(key).add(str.toLowerCase());
            } else {
                if (str.length() < 2) {
                    map.put(key, new ArrayList<>());
                } else {
                    Set<String> permutations = permutations(str);
                    Set<String> anagramList = new HashSet<>();

                    for (String temp : permutations) {
                        if (dictionary.contains(temp) && !temp.equalsIgnoreCase(str)) {
                            anagramList.add(temp);
                        }
                    }
                    map.put(key, new ArrayList<>(anagramList));
                }
            }
        }
        return map;
    }

   private Set<String> permutations(String str) {    
        if (str.isEmpty()) {
            return Collections.singleton(str);
        } else {
            Set<String> set = new HashSet<>();
            for (int i = 0; i < str.length(); i++)
                for (String s : permutations(str.substring(0, i) + str.substring(i + 1)))
                    set.add(str.charAt(i) + s);
            return set;
        }
    }

Edit: Based upon the excellent feedback I have changed my generator from a permutations to a work lookup:

public class AnagramGenerator {
private Map<String, Set<String>> groupedByAnagram = new HashMap<String, Set<String>>();

    private Set<String> dictionary;

    public AnagramGenerator(Set<String> dictionary) {

        this.dictionary = dictionary;
    }

 public List<Word> searchAlphabetically() {

        List<Word> words = new ArrayList<>();
        for (String word : dictionary) {
            String key = sortString(word);
            if (!groupedByAnagram.containsKey(key)) {
                groupedByAnagram.put(key, new HashSet<>());
            }
            if (!word.equalsIgnoreCase(key)) {
                groupedByAnagram.get(key).add(word);
            }
        }

        for (Map.Entry<String, Set<String>> entry : groupedByAnagram.entrySet()) {
            words.add(new Word(entry.getKey(), new ArrayList(entry.getValue())));
        }

        return words;
    }
 private String sortString(String goodString) {

        char[] letters = goodString.toLowerCase().toCharArray();
        Arrays.sort(letters);
        return new String(letters);
    }

It has a bit more tweaking so I don't add a word as it's own anagram, but otherwise this appears to be blazing fast. And, the code is much cleaner. Thanks everyone!


Solution

  • As noted for longer words, the number of permutations soon becomes enormous.

    /usr/share/dict/british-english on Debian has 99,156 lines. There are longer word lists, but let's use that as an example.

    The number of permutations for a nine letter word is 9! = 362,880

    Therefore, for words of 9 letters or more, it's less computational effort to try every word in the dictionary, than it is to try every permutation of the input word.

    10! milliseconds = ~1 hour
    12! milliseconds = ~5.54 days
    15! milliseconds = ~41.44 years
    

    And you'd be lucky to process one permutation per ms, so you can see you soon get to a number of permutations that are completely impractical to work with. The impact on stack and heap mounts up at the same rate.

    So, try the algorithm (pseudocode):

     sorted_input = sort_alphabetically(input_word)
     for each dictionary_word // probably a file readline()
         sorted_dictionary_word = sort_alphabetically(dictionary_word)
         if(sorted_dictionary_word = sorted_input)
             it's an anagram! Handle it
         end 
     end
    

    Similarly, you can fairly quickly write all dictionary-word algorithms into a lookup data structure. Pseudocode again; in Java you could use a Map<String, List<String>> or a MultiMap from Apache Commons or Guava:

      multimap = new MultiMap<String, String> // or whatever
    
      def build_dict:
          for each dictionary_word // probably a file readline()
              multimap.add(
                   sort_alphabetically(dictionary_word), 
                   dictionary_word)
          end
      end
    
      def lookup_anagrams(word):
          return multimap.get(sort_alphabetically(word))
      end 
    

    This takes up a moderate amount of memory (the whole dictionary, plus a bit for the keys and map overheads), but means that once the structure is created, you can query over and over again very cheaply indeed.

    If you want to find two-word anagrams, you will need a more complicated and interesting algorithm. But even then, avoiding brute-forcing the entire search-space of permutations is vital to your success.