Search code examples
pythonpandaspython-itertools

How to find out combination of two columns in Dataframe when there is multiple columns in Python?


I have data frame like this,

df = pd.DataFrame({'a':[1,2,3,3,4,4,4], 'b':[2,3,4,4,5,5,5], 'c':[5,6,7,8,5,5,5]})

I want to generate a new data frame with a combination of two columns. For example, I want to finds out combination of ( a,b a,c a,d b,c b,d c,d )

I tried to use this code

col_a_list = df['a'].tolist()
col_b_list = df['b'].tolist()
col_c_list = df['c'].tolist()

import itertools
def cartesian_product(lists):
    return list(itertools.product(*lists))
ls = [col_a_list,col_b_list,col_c_list,col_d_list]
print("Original Lists:",ls)
print("Cartesian product of the said lists: ",cartesian_product(ls))

However, it does not work. This generates a combination of three columns, not two.

I want my output like this,

column x  column y
1          2
1          5
1          3
1          6
1          4
1          7
.
.
.
.

Is there any way that I could generate these columns? Thanks in advance!


Solution

  • I believe you need:

    df = pd.DataFrame({'a':[1,2,3,3,4,4,4], 'b':[2,3,4,4,5,5,5], 
                       'c':[5,6,7,8,5,5,5], 'd':[5,6,7,8,5,5,5]})
    
    
    import itertools
    def comb(df):
        return [df.loc[:, list(x)].set_axis(['column x','column y'], axis=1) 
                for x in itertools.combinations(df.columns, 2)]
    
    df1 = pd.concat(comb(df)).sort_index(kind='mergesort').reset_index(drop=True)
    print(df1.head(10))
       column x  column y
    0         1         2
    1         1         5
    2         1         5
    3         2         5
    4         2         5
    5         5         5
    6         2         3
    7         2         6
    8         2         6
    9         3         6