Search code examples
pythonpandasdataframecombinationspython-itertools

Generate all multiplicative (product) combinations of columns in a pandas dataframe


I would like to generate all 2-way (and possibly 3-way) "multiplicative" combinations (i.e., column1 x column2, column2 x column3, column1 x column3, column1 x column 2 x column3, etc) of a pandas dataframe with about 100 columns (and 500K rows). I plan to evaluate these combinations for identifying high performing 'feature interactions' in a predictive model.

Thus far, my attempts (at either repurposing existing stackoverflow suggestions or other online materials) to generate the combinations have not been successful. Here is my minimal example with a very simple dataframe and the code I am using:

df = pd.DataFrame({'age':[10,20], 'height':[5, 6], 'weight':[100,150]})

sample_list = df.columns.to_list()
output = list(combinations(sample_list, 2))

pd.DataFrame(df[x].prod(axis = 1) for x in output)

My above code yields a "key error: ('age', 'height'). The final output should contain all 2-way (and potentially 3-way) 'multiplicative' combinations only (i.e., not include the original columns). enter image description here

Can somebody please guide me to the solution? Also, how would one modify the code to generate all 3-way combinations? Any suggestions to optimize for limited RAM (~30GB) are also appreciated.


Solution

  • Consider a chained multiplication approach with reduce to build a list of Series to ultimately concat:

    from functools import reduce
    from itertools import combinations
    import pandas as pd
    
    df = pd.DataFrame({
        'age':[10, 20], 'height':[5, 6], 'weight':[100,150]
    })
    
    def chain_mul(cols):
        col_name = "*".join(cols)    
        series_dict = df[list(cols)].to_dict('series')
        col_prod = reduce(lambda x,y: x.mul(y), series_dict.values())
        return pd.Series(col_prod, name=col_name)    
    
    # BUILD COLUMN COMBINATIONS (DUOS AND TRIOS)
    sample_list = df.columns.to_list()
    combns = (
        list(combinations(sample_list, 2)) +
        list(combinations(sample_list, 3))
    )
    
    # BUILD LIST OF COLUMN PRODUCTS
    series_list = [chain_mul(cols) for cols in combns]
    
    # HORIZONTAL JOIN
    interactions_df = pd.concat(series_list, axis=1)
    
    print(interactions_df)
    #    age*height  age*weight  height*weight  age*height*weight
    # 0          50        1000            500               5000
    # 1         120        3000            900              18000