python algorithm string-matching words equivalence

Find equivalent words in a list through arbitrary mapping

Imagine you have a list of words:

['cat', 'ant', 'bro', 'gro']

Using some arbitrary mapping that we construct ourselves {'c'=>'a', 'a'=>'n', 't'=>'t' }, we can map 'cat' to 'ant', and similarly we can find some arbitrary mapping to convert 'bro' to 'gro'.

This is the idea of finding words that are equivalent. I wrote a function that compares two words, and checks if they are equivalent through a mapping I construct on the fly:

def compareWords(w1, w2):
   mapping = {}
   for i in xrange(0, len(w1)):
       if w1[i] in map:
           if mapping[w1[i]] == w2[i]:
               continue
           else:
               return False
        else:
           mapping[w1[i]] = w2[i]

    return True

Example input:

['cat', 'ant', 'bro', 'gro']

Example output:

[['cat','ant'], ['bro', 'gro']]

Using this function on each pair of words in the list to return an output list of lists of equivalent words runs in O(n^2) since every pair will need to be compared. I haven't implemented this part where I use this method above on the input list and generate the output list, since this method isn't the efficient comparison I am looking for. Is there a way to find equivalent words in this input list in O(n) time?

Further Explanation: If I have a list of words, and I want to find all the "equivalent" words, I would need to make the checks in pairs of words. If all letters of the words I am comparing are unique, then another word in the list is only equivalent to this first word if all the letters in the second word are also unique. so abc can be mapped to xyz if xyz is in the list. xyz can be mapped to pqr if xyz is in the list. Given this, abc, xyz, and pqr are all equivalent. This is the kind of grouping I am looking for.

Solution

If I understood correctly, you're looking for a way to check if an arbitrary relation, given as a list of pairs x,y is a function, that is, x1==x2 implies y1==y2. This can be done easily with sets:

def is_function(rel):
    return len(set(rel)) == len(set(x for x, y in rel))


print is_function(['ab', 'cd', 'xd']) # yes
print is_function(['ab', 'cd', 'ad']) # no

Two words are "equivalent" in the sense of your question if their letter-to-letter relation is a function:

def equivalent(a, b):
    return is_function(zip(a, b))

print equivalent('foo', 'baa') # yes
print equivalent('foo', 'bar') # no

If you consider equivalences between different words as different relations, there's no way you can avoid comparing each one with each. Moreover, your "equivalence" isn't even commutative, A ~ B doesn't imply B ~ A (e.g abc ~ xyx, but xyx !~ abc).

From your comment, your relation turns out to be bijective (note: your code is not correct for this case). The simplest (not necessarily the most efficient) way to split the list into "equivalence classes" would be to calculate a "hash" for each word, replacing letters with numbers, where 0 stands for the first letter, 1 for the second etc:

def eq_hash(word):
    return tuple(word.index(c) for c in word)

print eq_hash('mom') # 0 1 0
print eq_hash('dad') # 0 1 0

Now, you can group together all words that have the same "hash". These will be equivalent in the context of your question:

group = {}

words = ['mom', 'dad', 'aaa', 'bob', 'ccc', 'xyz', 'abc']

for w in words:
    h = eq_hash(w)
    group[h] = group.get(h, []) + [w]

print group.values()
# [['xyz', 'abc'], ['mom', 'dad', 'bob'], ['aaa', 'ccc']]