Search code examples
pythonrandompermutationcombinatoricssampling

Random sampling of n lists of m elements in python


I wrote this code which creates all combinations of n lists of m elements in python, samples a given number of unique combinations (max possible or 1000) and outputs it in excel. It basically works, but the problem is that when product(m_i) becomes very large, it is extremely slow.

A realistic use case could be that I have 32 lists with each 2-3 elements in each, from which I would need to sample 1000 unique combinations. That could be 10 billion combinations, but it is slow to create all these combinations, when I actually only need 1000 unique combinations.

I did consider just creating random samples and checking whether I already created this one, but that would become slow when numbers of samples approach number of possible permutations.

Image of data

import pandas as pd

df = pd.read_excel('Variables.xlsx',sheet_name="Variables" ,index_col=0)
df_out = pd.DataFrame(columns=df.index)

df.shape[0]
def for_recursive(number_of_loops, range_list, execute_function, current_index=0, iter_list = []):
    if iter_list == []:
        iter_list = [0]*number_of_loops
    
    if current_index == number_of_loops-1:
        for iter_list[current_index] in range_list.iloc[current_index].dropna():
            execute_function(iter_list)
    else:
        for iter_list[current_index] in range_list.iloc[current_index].dropna():
            for_recursive(number_of_loops, iter_list = iter_list, range_list = range_list,  current_index = current_index+1, execute_function = execute_function) 
            
def do_whatever(index_list):
    df_out.loc[len(df_out)] = index_list
    
for_recursive(range_list = df, execute_function = do_whatever , number_of_loops=len(df))

df_out = df_out.sample(n=min(len(df_out),1000))

with pd.ExcelWriter("Variables.xlsx", engine="openpyxl", mode="a", if_sheet_exists="replace") as writer:
    df_out.to_excel(writer, 'Simulations', index=False)

Solution

  • Make use of the standard library functionality. The itertools module can generate the list of all possible combinations in the data.

    import pandas as pd
    
    data = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],  
                        'B': [1, 4, 6, 33, 0, 0, 0, 0, 0, 0],
                        'C': [2, 6, 8, 44, 1, 1, 1, 1, 1, 1],
                        'D': [3, 0, 0, 55, 0, 0, 0, 0, 0, 0],
                        })
    
    from itertools import product 
    
    full_collection = (list(product(data['A'], data['B'], data['C'], data['D'])))
    
    print(len(full_collection)) # 10000
    

    The random.sample function will generate unique samples without repetition.

    import random
    
    samples = random.sample(full_collection, 1000)
    

    EDIT: A possible alternative solution

    Instead of creating the list of all possible combinations, generate random combinations from the unique values in each of the dataset columns. The generator expression ensures a memory efficient solution, however it does not guarantee that each sample will be unique.

    sample_size = 1000
    
    # Get the column names
    col_names = tuple(data.columns)
    
    # Create a dictionary of unique values in each column
    unique_values = dict()
    for col_name in col_names:
        unique_values[col_name] = tuple(data[col_name].unique())
    
    # Create a sample generator
    samples_gen = (tuple([random.choice(unique_values[col_name]) 
                          for col_name in col_names]) 
                   for _ in range(sample_size))
    
    # Iterate through the generated samples
    while True:
        try:
            sample = next(samples_gen)
        except StopIteration:
            break
        # Do something with the sample
        print(sample)
    

    Using a closure function to create a simpler iterator:

    def sample_generator_from_dataframe(data, col_names=None):
        if col_names is None:
            col_names = tuple(data.columns)
        unique_values = dict()
        for col_name in col_names:
            unique_values[col_name] = tuple(data[col_name].unique())
    
        # An infinite sample generator
        def _generator():
            while True:
                yield tuple([random.choice(unique_values[col_name]) 
                            for col_name in col_names])
    
        return iter(_generator)
    
    
    # Initialise the generator with the dataframe content
    new_sample_gen = sample_generator_from_dataframe(data)
    
    # Iterate over generated samples
    for _ in range(sample_size):
        sample = next(new_sample_gen)
        # Do something with the generated sample
        print(sample)