Search code examples
pythonstringalgorithmcomparison

Find all pairs of strings in two lists that contain no common characters


I have two lists of strings, and wish to find all pairs of strings between them that contain no common characters. e.g.

list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']

desired output = [('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]

I need this to be as efficient as possible because I am processing lists with millions of strings. At the moment, my code follows this general pattern:

output = []

for str1 in list1:    
    for str2 in list2:
        if len(set(str1) & set(str2)) == 0: 
             output.append((str1, str2))

This is O(n^2) and is taking many hours to run, does anyone have some suggestions for how to speed this up? Perhaps there is a way to take advantage of characters in each string being sorted?

Thank you very much in advance!


Solution

  • Here's another tack, focusing on lowering the set operations to bit twiddling and combining words that represent the same set of letters:

    import collections
    import string
    
    
    def build_index(words):
        index = collections.defaultdict(list)
        for word in words:
            chi = sum(1 << string.ascii_lowercase.index(letter) for letter in set(word))
            index[chi].append(word)
        return index
    
    
    def disjoint_pairs(words1, words2):
        index1 = build_index(words1)
        index2 = build_index(words2)
        for chi1, words1 in index1.items():
            for chi2, words2 in index2.items():
                if chi1 & chi2:
                    continue
                for word1 in words1:
                    for word2 in words2:
                        yield word1, word2
    
    
    print(list(disjoint_pairs(["abc", "cde"], ["aij", "xyz", "abc"])))