Search code examples
pythondictionaryinformation-retrievaltrie

How to find possible English words in long random string?


I'm doing an artistic project where I want to see if any information emerges from a long string of characters (~28,000). It's sort of like the problem one faces in solving a Jumble. Here's a snippet:

jfifddcceaqaqbrcbdrstcaqaqbrcrisaxohvaefqiygjqotdimwczyiuzajrizbysuyuiathrevwdjxbinwajfgvlxvdpdckszkcyrlliqxsdpunnvmedjjjqrczrrmaaaipuzekpyqflmmymedvovsudctceccgexwndlgwaqregpqqfhgoesrsridfgnlhdwdbbwfmrrsmplmvhtmhdygmhgrjflfcdlolxdjzerqxubwepueywcamgtoifajiimqvychktrtsbabydqnmhcmjhddynrqkoaxeobzbltsuenewvjbstcooziubjpbldrslhmneirqlnpzdsxhyqvfxjcezoumpevmuwxeufdrrwhsmfirkwxfadceflmcmuccqerchkcwvvcbsxyxdownifaqrabyawevahiuxnvfbskivjbtylwjvzrnuxairpunskavvohwfblurcbpbrhapnoahhcqqwtqvmrxaxbpbnxgjmqiprsemraacqhhgjrwnwgcwcrghwvxmqxcqfpcdsrgfmwqvqntizmnvizeklvnngzhcoqgubqtsllvppnedpgtvyqcaicrajbmliasiayqeitcqtexcrtzacpxnbydkbnjpuofyfwuznkf

What's the most efficient way of searching for all possible English words embedded (both forwards and backwards) in this string?

What is a useful dictionary against which to check the substrings? Is there a good library for doing this sort of thing? I have searched around and found some interesting TRIE solutions; but most of them are dealing with the situation where you know the set of words in advance.


Solution

  • I used this solution to find all words forwards and backwards from a corpus of 28,000 random characters in a dictionary of 100,000 words in .5 seconds. It runs in O(n) time. It takes a file called "words.txt" which is a dictionary that has words separated by some kind of whitespace. I used the default unix wordlist in /usr/share/dict/words but I'm sure you can find plenty of text file dictionaries online if not that one.

    from random import choice
    import string
    
    dictionary = set(open('words.txt','r').read().lower().split())
    max_len = max(map(len, dictionary)) #longest word in the set of words
    
    text = ''.join([choice(string.ascii_lowercase) for i in xrange(28000)])
    text += '-'+text[::-1] #append the reverse of the text to itself
    
    words_found = set() #set of words found, starts empty
    for i in xrange(len(text)): #for each possible starting position in the corpus
        chunk = text[i:i+max_len+1] #chunk that is the size of the longest word
        for j in xrange(1,len(chunk)+1): #loop to check each possible subchunk
            word = chunk[:j] #subchunk
            if word in dictionary: #constant time hash lookup if it's in dictionary
                words_found.add(word) #add to set of words
    
    print words_found