python algorithm duplicates similarity edit-distance

Generating a list of distinct (distant, by edit distance) words by filtering

I have a long (> 1000 items) list of words, from which I would like to remove words that are "too similar" to other words, until the remaining words are all "significantly different". For example, so that no two words are within an edit distance D.

I do not need a unique solution, and it doesn't have to be exactly optimal, but it should be reasonably quick (in Python) and not discard way too many entries.

How can I achieve this? Thanks.

Edit: to be clear, I can google for a python routine that measures edit distance. The problem is how to do this efficiently, and, perhaps, in some way that finds a "natural" value of D. Maybe by constructing some kind of trie from all words and then pruning?

Solution

You can use a bk-tree, and before each item is added check that it is not within distance D of any others (thanks to @DietrichEpp in the comments for this idea.

You can use this recipe for a bk-tree (though any similar recipes are easily modified). Simply make two changes: change the line:

def __init__(self, items, distance, usegc=False):

def __init__(self, items, distance, threshold=0, usegc=False):

And change the line

        if el not in self.nodes: # do not add duplicates

        if (el not in self.nodes and
            (threshold == None or len(self.find(el, threshold)) == 0)):

This makes sure there are no duplicates when an item is added. Then, the code to remove duplicates from a list is simply:

from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

Note that this relies on the python-Levenshtein package for its distance function, which is much faster than the one provided by bk-tree. python-Levenshtein has C-compiled components, but it's worth the installation.

Finally, I set up a performance test with an increasing number of words (grabbed randomly from /usr/share/dict/words) and graphed the amount of time each took to run:

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

enter image description here

Without confirming it mathematically, I don't think it's quadratic, and I think it might actually be n log n, which would make sense if inserting into a bk-tree is a log time operation. Most notably, it runs pretty quickly with under 5000 strings, which hopefully is the OP's goal (and it's reasonable with 15000, which a traditional for loop solution would not be).