Search code examples
pythonpython-3.xloopscombinatoricspython-itertools

Dynamically create all column combinations in a pandas data frame


I have a data frame df with columns and string values in it. My goal is to create a data frame final_df whose columns represent all possible combinations of the df's columns including their values (ideally separated by a _ [not in the sample code]).

Example Code:

import pandas as pd
from  itertools import combinations

d = {'AAA': ["xzy", "gze"], 'BBB': ["abc", "hja"], 'CCC': ["dfg", "hza"], 'DDD': ["hij", "klm"], 'EEE': ["lal", "opa"]}
df = pd.DataFrame(data=d)

# two combinations
cc = list(combinations(df.columns,2))
df_2 = pd.concat([df[c[0]] + df[c[1]] for c in cc], axis=1, keys=cc)
df_2.columns = df_2.columns.map(''.join)

# three attributes
del cc
cc = list(combinations(df.columns,3))
df_3 = pd.concat([df[c[0]] + df[c[1]] + df[c[2]] for c in cc], axis=1, keys=cc)
df_3.columns = df_3.columns.map(''.join)

# four attributes
del cc
cc = list(combinations(df.columns,4))
df_4 = pd.concat([df[c[0]] + df[c[1]] + df[c[2]] + df[c[3]] for c in cc], axis=1, keys=cc)
df_4.columns = df_4.columns.map(''.join)

# five attributes
del cc
cc = list(combinations(df.columns,5))
df_5 = pd.concat([df[c[0]] + df[c[1]] + df[c[2]] + df[c[3]] + df[c[4]] for c in cc], axis=1, keys=cc)
df_5.columns = df_5.columns.map(''.join)

# join dataframes
dfs = [df, df_2, df_3, df_4, df_5]
final_df = dfs[0].join(dfs[1:])

Is there a Pythonic way to dynamically create such a final_df data frame, depending on the number of columns?


Solution

  • I thought of a solution, however... the column names will not change.

    def combodf(dfx, x): 
        d = (['_'.join(i) for i in zip(*a)] for a in combinations(df.T.values.tolist(), x)) 
        return pd.DataFrame(d).T 
    
    final_df = pd.concat([df, *(combodf(df, i) for i in range(2,6))], 1) 
    

    But looking at your "column" structure it would just make more sense to have them as values. So here is a workaround where we move the column to the last row.

    import pandas as pd
    from itertools import combinations
    
    def combodf(dfx, x):
        d = [['_'.join(i) for i in zip(*a)] for a in combinations(df.T.values.tolist(), x)]
        return pd.DataFrame(d).T
    
    d = {
    'AAA': ["xzy", "gze"], 
    'BBB': ["abc", "hja"], 
    'CCC': ["dfg", "hza"], 
    'DDD': ["hij", "klm"], 
    'EEE': ["lal", "opa"]
    }
    
    df = pd.DataFrame(data=d)
    df.loc[len(df)] = df.columns # insert columns last row
    df = pd.concat([df, *(combodf(df, i) for i in range(2,6))], 1)
    df.columns = df.tail(1).values[0] # make last row columns
    df = df.drop(2) # drop last row
    

    Comparison:

    print((df == final_df).all().all()) # True
    print((df.columns == final_df.columns).all()) # True