Search code examples
pythonpandasdataframecombinations

Summing columns of Pandas dataframe in a systematic way


I have a pandas dataframe which looks like this:

1_2 1_3 1_4 2_3 2_4 3_4
1   5   2   8   2   2
4   3   4   5   8   5
8   8   8   9   3   3
4   3   4   4   8   3
8   0   7   4   2   2

where the columns are the 4C2 combinations of 1,2,3,4. And I would like to generate 4 new columns f_1, f_2, f_3, f_4 where the values of the columns are defined to be

df['f_1'] = df['1_2']+df['1_3']+df['1_4']
df['f_2'] = df['1_2']+df['2_3']+df['2_4']
df['f_3'] = df['1_3']+df['2_3']+df['3_4']
df['f_4'] = df['1_4']+df['2_4']+df['3_4']

In other words, the column f_i are defined to be the sum of columns i_j and k_i.

So I can brute force my way in this case. However, my original dataframe is a lot bigger and there are 20C2 = 190 columns instead and hence a brute force method wouldn't work.

So the desired outcome looks like

1_2 1_3 1_4 2_3 2_4 3_4 f_1 f_2 f_3 f_4
1   5   2   8   2   2   8   11  15  6
4   3   4   5   8   5   11  17  13  17
8   8   8   9   3   3   24  20  20  14
4   3   4   4   8   3   11  16  10  15
8   0   7   4   2   2   15  14  6   11

Thank you so much.


Solution

  • Build a dictionary of the columns with str.split+explode+Index.groupby, and process them in a loop:

    s = df.columns.to_series().str.split('_').explode()
    d = s.index.groupby(s)
    
    for k, v in d.items():
        df[f'f_{k}'] = df[v].sum(axis=1)
    

    You could also use eval instead of the loop once you have d:

    query = '\n'.join(f'f_{k} = {"+".join(map("`{}`".format, v))}'
                      for k,v in d.items())
    out = df.eval(query)
    

    Output:

       1_2  1_3  1_4  2_3  2_4  3_4  f_1  f_2  f_3  f_4
    0    1    5    2    8    2    2    8   11   15    6
    1    4    3    4    5    8    5   11   17   13   17
    2    8    8    8    9    3    3   24   20   20   14
    3    4    3    4    4    8    3   11   16   10   15
    4    8    0    7    4    2    2   15   14    6   11
    

    Intermediate d:

    {'1': ['1_2', '1_3', '1_4'],
     '2': ['1_2', '2_3', '2_4'],
     '3': ['1_3', '2_3', '3_4'],
     '4': ['1_4', '2_4', '3_4'],
    }
    

    Pure python approach to build d:

    d = {}
    for c in df:
        for k in c.split('_'):
            d.setdefault(k, []).append(c)
    

    You could also imagine a pure pandas approach based on reshaping with melt+pivot_table, but this is most likely much less efficient:

    out = df.join(df
       .set_axis(df.columns.str.split('_'), axis=1)
       .melt(ignore_index=False).explode('variable')
       .reset_index()
       .pivot_table(index='index', columns='variable',
                    values='value', aggfunc='sum')
       .add_prefix('f_')
    )