Search code examples
pythonkeycombinationspython-itertools

Excluding combinations with duplicates based on a key function


I want to iterate over subsequences of a list, but I have a notion of uniqueness defined by an external function, and I want to ignore combinations that have multiple elements with the same value under the function.

For example, I have a list of names, and I want to iterate over all combinations of three of those names such that all three names start with different letters. The following code accomplishes this:

import itertools

names = ["Anabel",
         "Alison",
         "Avery",
         "Abigail",
         "Aimee",
         "Alice",
         "Bethany",
         "Beatrice",
         "Claudia",
         "Carolyn",
         "Diane",
         "Dana"]

f = lambda x : x[0]

for i in itertools.combinations(names, 3):
    if ((f(i[0]) != f(i[1])) and
        (f(i[0]) != f(i[2])) and
        (f(i[1]) != f(i[2]))):
        print i

What I'm actually doing here is iterating over all possible combinations of 3 names, and throwing out the ones that don't, which is of course slower than iterating over all combinations of 3 names. Is there a way that would actually be faster? To create an iterator that excludes the ones I want to skip in the first place?


Solution

  • Yes it's possible, one solution I could think of requires to create a dictionary grouping the f(value) and the actual names as a dictionary. I'll use iteration_utilities.groupedby here but it's easy to do it yourself using collections.defaultdict as well (I'll show it at the bottom of the answer).

    >>> from iteration_utilities import groupedby 
    >>> equivalent = groupedby(names, f)
    >>> equivalent
    {'A': ['Anabel', 'Alison', 'Avery', 'Abigail', 'Aimee', 'Alice'],
     'B': ['Bethany', 'Beatrice'],
     'C': ['Claudia', 'Carolyn'],
     'D': ['Diane', 'Dana']}
    

    Then you iterate over the combinations of the (sorted) keys in that dictionary and then use itertools.product to do the iterations over all names for each prefix:

    import itertools
    
    for comb in itertools.combinations(sorted(equivalent), 3):
        for uniquecomb in itertools.product(*[equivalent[i] for i in comb]):
            print(uniquecomb)
    

    The sorted is used because otherwise the order of appearance wouldn't be deterministic between runs.


    To show that this is much faster I used the following setup:

    def unique_combs(names, f):
        equivalent = groupedby(names, f)
    
        for comb in itertools.combinations(sorted(equivalent), 3):
            for uniquecomb in itertools.product(*[equivalent[i] for i in comb]):
                pass
    
    def unique_combs_original(names, f):
        for i in itertools.combinations(names, 3):
            if ((f(i[0]) != f(i[1])) and
                    (f(i[0]) != f(i[2])) and
                    (f(i[1]) != f(i[2]))):
                pass
    
    names = ["Anabel", "Alison", "Avery", "Abigail", "Aimee", "Alice",
             "Bethany", "Beatrice",
             "Claudia", "Carolyn",
             "Diane", "Dana"]
    
    f = lambda x : x[0]
    
    %timeit unique_combs(names, f)           # 10000 loops, best of 3: 59.4 µs per loop
    %timeit unique_combs_original(names, f)  # 1000 loops, best of 3: 417 µs per loop
    

    But it also scales much better if there are lots of to-be-discarded combinations:

    names = names * 10  # more duplicates
    
    %timeit unique_combs(names, f)           # 100 loops, best of 3: 9.74 ms per loop
    %timeit unique_combs_original(names, f)  # 1 loop, best of 3: 577 ms per loop
    

    I mentioned defaultdict instead of groupedby, for completness it can be created like this:

    from collections import defaultdict
    
    >>> names = ["Anabel", "Alison", "Avery", "Abigail", "Aimee", "Alice",
    ...          "Bethany", "Beatrice",
    ...         "Claudia", "Carolyn",
    ...         "Diane", "Dana"]
    
    >>> equivalent = defaultdict(list)
    >>> for name in names:
    ...     equivalent[f(name)].append(name)
    
    >>> equivalent
    defaultdict(list,
                {'A': ['Anabel', 'Alison', 'Avery', 'Abigail', 'Aimee', 'Alice'],
                 'B': ['Bethany', 'Beatrice'],
                 'C': ['Claudia', 'Carolyn'],
                 'D': ['Diane', 'Dana']})