Search code examples
pythonoptimizationpython-2.xlevenshtein-distanceword-diff

How can I optimize this Python code to generate all words with word-distance 1?


Profiling shows this is the slowest segment of my code for a little word game I wrote:

def distance(word1, word2):
    difference = 0
    for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

def getchildren(word, wordlist):
    return [ w for w in wordlist if distance(word, w) == 1 ]

Notes:

  • distance() is called over 5 million times, majority of which is from getchildren, which is supposed to get all words in the wordlist that differ from word by exactly 1 letter.
  • wordlist is pre-filtered to only have words containing the same number of letters as word so it's guaranteed that word1 and word2 have the same number of chars.
  • I'm fairly new to Python (started learning it 3 days ago) so comments on naming conventions or other style things also appreciated.
  • for wordlist, take the 12dict word list using the "2+2lemma.txt" file

Results:

Thanks everyone, with combinations of different suggestions I got the program running twice as fast now (on top of the optimizations I did on my own before asking, so 4 times speed increase approx from my initial implementation)

I tested with 2 sets of inputs which I'll call A and B

Optimization1: iterate over indices of word1,2 ... from

for i in range(len(word1)):
        if word1[i] != word2[i]:
            difference += 1
    return difference

to iterate on letter-pairs using zip(word1, word2)

for x,y in zip (word1, word2):
        if x != y:
            difference += 1
    return difference

Got execution time from 11.92 to 9.18 for input A, and 79.30 to 74.59 for input B

Optimization2: Added a separate method for differs-by-one in addition to the distance-method (which I still needed elsewhere for the A* heuristics)

def is_neighbors(word1,word2):
    different = False
    for c1,c2 in zip(word1,word2):
        if c1 != c2:
            if different:
                return False
            different = True
    return different

Got execution time from 9.18 to 8.83 for input A, and 74.59 to 70.14 for input B

Optimization3: Big winner here was to use izip instead of zip

Got execution time from 8.83 to 5.02 for input A, and 70.14 to 41.69 for input B

I could probably do better writing it in a lower level language, but I'm happy with this for now. Thanks everyone!

Edit again: More results Using Mark's method of checking the case where the first letter doesn't match got it down from 5.02 -> 3.59 and 41.69 -> 29.82

Building on that and incorporating izip instead of range, I ended up with this:

def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different

Which squeezed a little bit more, bringing the times down from 3.59 -> 3.38 and 29.82 -> 27.88

Even more results!

Trying Sumudu's suggestion that I generate a list of all strings that are 1 letter off from "word" and then checking to see which ones were in the wordlist, instead of the is_neighbor function I ended up with this:

def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return ( w for w in oneoff if w in wordlist )

Which ended up being slower (3.38 -> 3.74 and 27.88 -> 34.40) but it seemed promising. At first I thought the part I'd need to optimize was "one_letter_off_strings" but profiling showed otherwise and that the slow part was in fact

( w for w in oneoff if w in wordlist )

I thought if there'd be any difference if I switched "oneoff" and "wordlist" and did the comparison the other way when it hit me that I was looking for the intersection of the 2 lists. I replace that with set-intersection on the letters:

return set(oneoff) & set(wordlist)

Bam! 3.74 -> 0.23 and 34.40 -> 2.25

This is truely amazing, total speed difference from my original naive implementation: 23.79 -> 0.23 and 180.07 -> 2.25, so approx 80 to 100 times faster than the original implementation.

If anyone is interested, I made blog post describing the program and describing the optimizations made including one that isn't mentioned here (because it's in a different section of code).

The Great Debate:

Ok, me and Unknown are having a big debate which you can read in the comments of his answer. He claims that it would be faster using the original method (using is_neighbor instead of using the sets) if it was ported to C. I tried for 2 hours to get a C module I wrote to build and be linkable without much success after trying to follow this and this example, and it looks like the process is a little different in Windows? I don't know, but I gave up on that. Anyway, here's the full code of the program, and the text file come from the 12dict word list using the "2+2lemma.txt" file. Sorry if the code's a little messy, this was just something I hacked together. Also I forgot to strip out commas from the wordlist so that's actually a bug that you can leave in for the sake of the same comparison or fix it by adding a comma to the list of chars in cleanentries.

from itertools import izip
def unique(seq):  
    seen = {} 
    result = [] 
    for item in seq: 
        if item in seen:
            continue 
        seen[item] = 1 
        result.append(item) 
    return result
def cleanentries(li):
    pass
    return unique( [w.strip('[]') for w in li if w != "->"] )
def distance(word1, word2):
    difference = 0
    for x,y in izip (word1, word2):
        if x != y:
            difference += 1
    return difference
def is_neighbors(word1,word2):
    if word1[0] != word2[0]:
        return word1[1:] == word2[1:]
    different = False
    for x,y in izip(word1[1:],word2[1:]):
        if x != y:
            if different:
                return False
            different = True
    return different
def one_letter_off_strings(word):
    import string
    dif_list = []
    for i in xrange(len(word)):
        dif_list.extend((word[:i] + l + word[i+1:] for l in string.ascii_lowercase if l != word[i]))
    return dif_list

def getchildren(word, wordlist):
    oneoff = one_letter_off_strings(word)
    return set(oneoff) & set(wordlist)
def AStar(start, goal, wordlist):
    import Queue
    closedset = []
    openset = [start]
    pqueue = Queue.PriorityQueue(0)
    g_score = {start:0}         #Distance from start along optimal path.
    h_score = {start:distance(start, goal)}
    f_score = {start:h_score[start]}
    pqueue.put((f_score[start], start))
    parent_dict = {}
    while len(openset) > 0:
        x = pqueue.get(False)[1]
        if x == goal:
            return reconstruct_path(parent_dict,goal)
        openset.remove(x)
        closedset.append(x)
        sortedOpen = [(f_score[w], w, g_score[w], h_score[w]) for w in openset]
        sortedOpen.sort()
        for y in getchildren(x, wordlist):
            if y in closedset:
                continue
            temp_g_score = g_score[x] + 1
            temp_is_better = False
            appended = False
            if (not y in openset): 
                openset.append(y)
                appended = True
                h_score[y] = distance(y, goal)
                temp_is_better = True
            elif temp_g_score < g_score[y] :
                temp_is_better = True
            else :
                pass
            if temp_is_better:
                parent_dict[y] = x
                g_score[y] = temp_g_score
                f_score[y] = g_score[y] + h_score[y]
                if appended :
                    pqueue.put((f_score[y], y))
    return None


def reconstruct_path(parent_dict,node):
     if node in parent_dict.keys():
         p = reconstruct_path(parent_dict,parent_dict[node])
         p.append(node)
         return p
     else:
         return []        

wordfile = open("2+2lemma.txt")
wordlist = cleanentries(wordfile.read().split())
wordfile.close()
words = []
while True:
    userentry = raw_input("Hello, enter the 2 words to play with separated by a space:\n ")
    words = [w.lower() for w in userentry.split()]
    if(len(words) == 2 and len(words[0]) == len(words[1])):
        break
print "You selected %s and %s as your words" % (words[0], words[1])
wordlist = [ w for w in wordlist if len(words[0]) == len(w)]
answer = AStar(words[0], words[1], wordlist)
if answer != None:
    print "Minimum number of steps is %s" % (len(answer))
    reply = raw_input("Would you like the answer(y/n)? ")
    if(reply.lower() == "y"):
        answer.insert(0, words[0])
        print "\n".join(answer)
    else:
        print "Good luck!"
else:
    print "Sorry, there's no answer to yours"
reply = raw_input("Press enter to exit")

I left the is_neighbors method in even though it's not used. This is the method that is proposed to be ported to C. To use it, just replace getchildren with this:

def getchildren(word, wordlist):
    return ( w for w in wordlist if is_neighbors(word, w))

As for getting it to work as a C module I didn't get that far, but this is what I came up with:

#include "Python.h"

static PyObject *
py_is_neighbor(PyObject *self, Pyobject *args)
{
    int length;
    const char *word1, *word2;
    if (!PyArg_ParseTuple(args, "ss", &word1, &word2, &length))
        return NULL;

    int i;
    int different = 0;
    for (i =0; i < length; i++)
    {
        if (*(word1 + i) != *(word2 + i))
        {
            if (different)
            {
                return Py_BuildValue("i", different);
            }
            different = 1;
        }
    }
    return Py_BuildValue("i", different);
}

PyMethodDef methods[] = {
    {"isneighbor", py_is_neighbor, METH_VARARGS, "Returns whether words are neighbors"},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initIsNeighbor(void)
{
    Py_InitModule("isneighbor", methods);
}

I profiled this using:

python -m cProfile "Wordgame.py"

And the time recorded was the total time of the AStar method call. The fast input set was "verse poets" and the long input set was "poets verse". Timings will obviously vary between different machines, so if anyone does end up trying this give result comparison of the program as is, as well as with the C module.


Solution

  • If your wordlist is very long, might it be more efficient to generate all possible 1-letter-differences from 'word', then check which ones are in the list? I don't know any Python but there should be a suitable data structure for the wordlist allowing for log-time lookups.

    I suggest this because if your words are reasonable lengths (~10 letters), then you'll only be looking for 250 potential words, which is probably faster if your wordlist is larger than a few hundred words.