Search code examples
pythonpandasdataframedata-analysis

Create column from all possible combination of two columns in dataframe based on groupby in Python


I have a dataframe in like below

id group log
10 UU1Q 23
10 UU1Q 12
10 UU2Q 15
11 UU2Q 17
11 UU3Q 35.6
11 UU1Q 29.8
11 UU1Q 33
11 UU1Q 44
13 UU2Q 17.77
13 UU2Q 19.90
13 UU2Q 55
14 UU3Q 33
15 UU3Q 22

For Each ID and group I want to create all possible combination of log values present in the dataframe in a new column. Desired Output

id group log new_col
10 UU1Q 23 (23,23)
10 UU1Q 12 (23,12)
10 UU2Q 15 (15,15)
11 UU2Q 17 (17,17)
11 UU3Q 35.6 (35.6,35.6)
11 UU1Q 29.8 (29.8, 29.8)
11 UU1Q 33 (29.8,33)
11 UU1Q 44 (29.8,44)
11 UU1Q (33,44)
13 UU2Q 17.77 (17.77,17.77)
13 UU2Q 19.90 (17.77,19.90)
13 UU2Q 55 (17.77,55)
13 UU2Q (19.90,55)
14 UU3Q 33 (33,33)
15 UU3Q 22 (22,22)

I used shift function but it is only generating combinations with next matching cell. I want to get all the possible combination in each group.
dummy['new'] = dummy.groupby(['ID', 'group'])['log'].shift()


Solution

  • This is close what need - added all combinations and if one element per group is created tuple with same values:

    from  itertools import  combinations
    
    df = (df.groupby(['id','group'])['log']
            .apply(lambda x: list(combinations(x, 2)) if len(x) > 1 else [(*x, *x)])
            .explode()
            .reset_index(name='comb'))
    print (df)
        id group           comb
    0   10  UU1Q   (23.0, 12.0)
    1   10  UU2Q   (15.0, 15.0)
    2   11  UU1Q   (29.8, 33.0)
    3   11  UU1Q   (29.8, 44.0)
    4   11  UU1Q   (33.0, 44.0)
    5   11  UU2Q   (17.0, 17.0)
    6   11  UU3Q   (35.6, 35.6)
    7   13  UU2Q  (17.77, 19.9)
    8   13  UU2Q  (17.77, 55.0)
    9   13  UU2Q   (19.9, 55.0)
    10  14  UU3Q   (33.0, 33.0)
    11  15  UU3Q   (22.0, 22.0)
    

    Or is possible create same values tuples of first rows per ['id','group'] and join to DataFrame df1 filled by combinations:

    from  itertools import  combinations
    
    df1 = (df.groupby(['id','group'])['log']
            .apply(lambda x: list(combinations(x, 2)))
            .explode()
            .dropna()
            .reset_index(name='comb'))
    
    df2 = df.groupby(['id','group']).head(1).copy()
    df2['comb'] = df2.pop('log').map(lambda x: (x,x))
    
    df = pd.concat([df2, df1]).sort_values(['id','group'], ignore_index=True)
    print (df)
        id group            comb
    0   10  UU1Q    (23.0, 23.0)
    1   10  UU1Q    (23.0, 12.0)
    2   10  UU2Q    (15.0, 15.0)
    3   11  UU1Q    (29.8, 29.8)
    4   11  UU1Q    (29.8, 33.0)
    5   11  UU1Q    (29.8, 44.0)
    6   11  UU1Q    (33.0, 44.0)
    7   11  UU2Q    (17.0, 17.0)
    8   11  UU3Q    (35.6, 35.6)
    9   13  UU2Q  (17.77, 17.77)
    10  13  UU2Q   (17.77, 19.9)
    11  13  UU2Q   (17.77, 55.0)
    12  13  UU2Q    (19.9, 55.0)
    13  14  UU3Q    (33.0, 33.0)
    14  15  UU3Q    (22.0, 22.0)