Search code examples
pythonpandaspython-itertools

How to find out combination of two columns in Dataframe? when there is multiple columns in dataframes?


I have the following dataframe...

df1:
playerA   playerB  PlayerC PlayerD
kim         lee      b      f
jackson     kim      d      g
dan         lee      a      d

I want to generate a new data frame with all possible combinations of two columns. For example,

df_new:
Target   Source  
kim         lee
kim         kim
kim         lee
kim          b     
kim          d
kim          a
kim          f
kim          g
kim          d      
jackson      lee
jackson      kim
jackson      lee
jackson      b
.
.
.
.
lee         kim
lee         jackson
lee          dan
lee          b
lee          d
.
.
.

Thus, I tried this code t

import itertools
def comb(df1):
    return [df1.loc[:, list(x)].set_axis(['Target','Source'], axis=1)
            for x in itertools.combinations(df1.columns, 2)]

However, It only shows combinations between columns in the same row.

Is there any way that I could generate all the possible combination between columns? Thanks in advance!


Solution

  • A way from itertools via permutations, product and chain.from_iterable:

    from itertools import chain, permutations, product
    
    df = pd.DataFrame(
             chain.from_iterable(product(df1[col_1], df1[col_2])
                                 for col_1, col_2 in permutations(df1.columns, r=2)),
             columns=["Target", "Source"]
    )
    

    where we first get 2-permutations of all columns, then for each pair, form a product of their values. After doing this for all permutations, flatten them with chain.from_iterable and pass to the dataframe constructor.

    I get a 108 x 2 dataframe:

          Target Source
    0        kim    lee
    1        kim    kim
    2        kim    lee
    3    jackson    lee
    4    jackson    kim
    ..       ...    ...
    103        g      d
    104        g      a
    105        d      b
    106        d      d
    107        d      a
    

    (where 108 = 3*9*4: 3 = rows, 9 = rows * other columns, 4 = total columns).