Search code examples
pythonc++cperformanceanagram

most efficient way to find all the anagrams of each word in a list


I have been trying to create a program that can find all the anagrams(in the list) for each word in the text file (which contain about ~370k words seperated by '\n').

I've already written the code in python. And it took me about an hour to run. And was just wondering if there is a more efficient way of doing it.

My code

from tqdm.auto import tqdm

ls = open("words.txt","r").readlines()
ls = [i[:-1] for i in ls]
ls = [[i,''.join(sorted(i))] for i in ls]
ln = set([len(i[1]) for i in tqdm(ls)])

df = {}
for l in tqdm(ln):
    df[l] = [i for i in ls if len(i[0]) == l]

full = {}
for m in  tqdm(ls):
    if full.get(m[0]) == None:
        temp = []
        for i in df[len(m[0])]:
            if i[1] == m[1] and i[0] != m[0]:
                temp.append(i[0])
        for i in temp:
            full[i] = temp

if there are more efficient ways of writing this in other languages (Rust, C, C++, Java ...) It would be really helpful if you can also post that :)


Solution

  • Using the word sorted alphabetically by character as a search key is the direction to go. And maybe you are already doing this (I hardly ever use python) with this line in your code :

    [[i,''.join(sorted(i))] for i in ls]
    

    Anyway this is my c++ take on your problem. Live demo here : https://onlinegdb.com/_gauHBd_3

    #include <algorithm>        // for sorting
    #include <string>
    #include <unordered_map>    // for storing words/anagrams
    #include <iostream>
    #include <fstream>
    #include <set>
    
    // create a class that will hold all words
    class dictionary_t
    {
    public:
        // load a text file with one word per line
        void load(const std::string& filename)
        {
            std::ifstream file{ filename };
            std::string word;
    
            while (file >> word)
            {
                add_anagram(word);
            }
        }
    
        auto& find_anagrams(const std::string& word)
        {
            const auto key = get_key(word);
    
            // intentionally allow an empty entry to be made if word has no anagrams yet
            // for readability easier error handling (not for space/time efficiency)
            auto& anagrams = m_anagrams[key];
    
            return anagrams;
        }
    
        // show all anagrams for a word
        void show_anagrams(const std::string& word)
        {
            std::cout << "anagrams for word '" << word << "' are : ";
            auto anagrams = find_anagrams(word);
    
            for (const auto& anagram : anagrams)
            {
                if (anagram != word)
                {
                    std::cout << anagram << " ";
                }
            }
    
            std::cout << "\n";
        }
    
    private:
        // this function is key to the whole idea
        // two words are anagrams if they sort their letters
        // to the same order. e.g. beast and betas both sort (alphabetically) to abest 
        std::string get_key(const std::string& word)
        {
            std::string key{ word };
            // all anagrams sort to the same order of characters.
            std::sort(key.begin(), key.end()); 
            return key;
        }
    
        void add_anagram(const std::string& word)
        {
            // find the vector of anagrams for this word
            auto& anagrams = find_anagrams(word);
    
            // then add word to it (I use a set so all words will be unique even
            // if input file contains duplicates)
            anagrams.insert(word);
        }
    
        std::unordered_map<std::string, std::set<std::string>> m_anagrams;
    };
    
    
    int main()
    {
        dictionary_t dictionary;
        dictionary.load("words.txt");
    
        dictionary.show_anagrams("beast");
        dictionary.show_anagrams("tacos");
    
        return 0;
    }