Search code examples
c++python-3.xuniquecombinationslookup

FAST unique combinations (from list with duplicates) WITHOUT LOOKUPS


In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)

The question is how to generate ON-THE-FLY in a generator function all the unique combinations from a non-unique list without the computational expensive need of filtering out duplicates?

I consider combination comboA to be unique if there is no other combination comboB for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:

comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")

In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.

The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.

The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...

The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.

I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.

Here some timings to illustrate the current situation:

-----------------
 k: 6 len(ls): 48
Combos   Used Code                               Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.036 seconds
12271512 len(list(subbags(ls,k)))            : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) :  8.174 seconds
12271512 len(set(combinations(sorted(ls),k))):  7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.030 seconds
       1 len(list(subbags(ls,k)))            :  0.001 seconds
       1 len(list(uniqueCombinations(ls,k))) :  3.619 seconds
       1 len(set(combinations(sorted(ls),k))):  2.592 seconds

Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.

My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.


Solution

  • Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:

    def subbags(bag, k):
        a = sorted(bag)
        n = len(a)
        sub = []
    
        def index_of_next_unique_item(i):
            j = i + 1
    
            while j < n and a[j] == a[i]:
                j += 1
    
            return j
    
        def combinate(i):
            if len(sub) == k:
                yield tuple(sub)
            elif n - i >= k - len(sub):
                sub.append(a[i])
                yield from combinate(i + 1)
                sub.pop()
                yield from combinate(index_of_next_unique_item(i))
    
        yield from combinate(0)
    
    bag = [1, 2, 3, 1, 2, 1]
    k = 3
    i = -1
    
    print(sorted(bag), k)
    print('---')
    
    for i, subbag in enumerate(subbags(bag, k)):
        print(subbag)
    
    print('---')
    print(i + 1)
    

    Output:

    [1, 1, 1, 2, 2, 3] 3
    ---
    (1, 1, 1)
    (1, 1, 2)
    (1, 1, 3)
    (1, 2, 2)
    (1, 2, 3)
    (2, 2, 3)
    ---
    6
    

    Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.