Search code examples
pythonrsetpermutationpython-itertools

Set of HUGE permutation object (in Python or R)


Aim: I'd like to get (or be able to work with) a set of all possible permutations obtained from a list of strings.

Example in Python:

import pandas as pd
import itertools

list1 = ['A', 'A', 'B', 'B']

# Get all permutations
list1_perm = list(itertools.permutations(list1))

len(list1_perm)
24

list1_perm
[('A', 'A', 'B', 'B'),
 ('A', 'A', 'B', 'B'),
 ('A', 'B', 'A', 'B'),
 ('A', 'B', 'B', 'A'),
 ('A', 'B', 'A', 'B'),
 ('A', 'B', 'B', 'A'),
 ('A', 'A', 'B', 'B'),
 ('A', 'A', 'B', 'B'),
 ('A', 'B', 'A', 'B'),
 ('A', 'B', 'B', 'A'),
 ('A', 'B', 'A', 'B'),
 ('A', 'B', 'B', 'A'),
 ('B', 'A', 'A', 'B'),
 ('B', 'A', 'B', 'A'),
 ('B', 'A', 'A', 'B'),
 ('B', 'A', 'B', 'A'),
 ('B', 'B', 'A', 'A'),
 ('B', 'B', 'A', 'A'),
 ('B', 'A', 'A', 'B'),
 ('B', 'A', 'B', 'A'),
 ('B', 'A', 'A', 'B'),
 ('B', 'A', 'B', 'A'),
 ('B', 'B', 'A', 'A'),
 ('B', 'B', 'A', 'A')]

Since for my analysis ('A', 'A', 'B', 'B') is the same as ('A', 'A', 'B', 'B'), (although the 'A' may have changed the position), I do:

# Get set of permutations
set1_perm = set(itertools.permutations(list1))

len(set1_perm)
6

set1_perm
{('A', 'A', 'B', 'B'),
 ('A', 'B', 'A', 'B'),
 ('A', 'B', 'B', 'A'),
 ('B', 'A', 'A', 'B'),
 ('B', 'A', 'B', 'A'),
 ('B', 'B', 'A', 'A')}

Now this is great, but the list I want to work with has 481 strings, with 5 unique strings with different frequencies:

len(real_list)
481

len(set(real_list))
5

# Count number of times each unique value appears
pd.Series(real_list).value_counts()
A  141
B  116
C  80
D  78
E  66

This is not a problem for itertools.permutations(real_list), but when I want to get the set, it takes ages. This is because the number of permutations is 9.044272819E+1082.

What I want to do is: First I want to know the number of unique elements in that permutation space, i.e. the length of the set. To get the number of unique elements it might be possible to do it analytically, however since the frequency of each unique element is different I don't how to do that.

Second I would like to be able to get a sample of those unique elements in the set of permutations.

I'd appreciate any help provided.

Best, Alejandro


Solution

  • Calculating the number of unique permutations is simply a matter of applying a formula - we know that were we to have n distinct elements, we would have n! permutations. Then to account for repeated permutations we must divide by each count of permutations of repeated letters. This is a multinomial coefficient.

    enter image description here

    So a simple implementation to generate the unique count may look something like

    from math import factorial
    from functools import reduce
    from collections import Counter
    
    def perm_cnt(l):
        denom = reduce(lambda x,y: x*factorial(y), Counter(l).values())
        return factorial(len(l)) // denom
    

    Then sampling from the unique permutations is probably most simply achieved by just ensuring your sample values remain unique, as opposed to trying to generate all of the unique values and then sampling. There is a recipe in the itertools module, random_permutation, which could be useful for this.

    def random_permutation(iterable, r=None):
        "Random selection from itertools.permutations(iterable, r)"
        pool = tuple(iterable)
        r = len(pool) if r is None else r
        return tuple(random.sample(pool, r))
    

    So creating a unique sample might look something like

    def uniq_sample(l, size):
        s = set()
        perm_size = perm_cnt(l)
        cnt = 0
        while cnt < min(perm_size, size):
            samp = random_permutation(l)
            if samp not in s:
                s.add(samp)
                cnt += 1
        return s
    

    Demo

    >>> perm_cnt(list1)
    6
    
    >>> perm_cnt(['a']*3 + ['b']*5 + ['d']*2)
    2520
    
    >>> perm_cnt(np.random.randint(10, size=20))
    105594705216000
    
    >>> uniq_sample(list1, 4)
    {('A', 'A', 'B', 'B'),
     ('B', 'A', 'A', 'B'),
     ('B', 'A', 'B', 'A'),
     ('B', 'B', 'A', 'A')}