I have a pandas dataframe which looks like this:
1_2 1_3 1_4 2_3 2_4 3_4
1 5 2 8 2 2
4 3 4 5 8 5
8 8 8 9 3 3
4 3 4 4 8 3
8 0 7 4 2 2
where the columns are the 4C2 combinations of 1,2,3,4. And I would like to generate 4 new columns f_1, f_2, f_3, f_4
where the values of the columns are defined to be
df['f_1'] = df['1_2']+df['1_3']+df['1_4']
df['f_2'] = df['1_2']+df['2_3']+df['2_4']
df['f_3'] = df['1_3']+df['2_3']+df['3_4']
df['f_4'] = df['1_4']+df['2_4']+df['3_4']
In other words, the column f_i
are defined to be the sum of columns i_j
and k_i
.
So I can brute force my way in this case. However, my original dataframe is a lot bigger and there are 20C2 = 190
columns instead and hence a brute force method wouldn't work.
So the desired outcome looks like
1_2 1_3 1_4 2_3 2_4 3_4 f_1 f_2 f_3 f_4
1 5 2 8 2 2 8 11 15 6
4 3 4 5 8 5 11 17 13 17
8 8 8 9 3 3 24 20 20 14
4 3 4 4 8 3 11 16 10 15
8 0 7 4 2 2 15 14 6 11
Thank you so much.
Build a dictionary of the columns with str.split
+explode
+Index.groupby
, and process them in a loop:
s = df.columns.to_series().str.split('_').explode()
d = s.index.groupby(s)
for k, v in d.items():
df[f'f_{k}'] = df[v].sum(axis=1)
You could also use eval
instead of the loop once you have d
:
query = '\n'.join(f'f_{k} = {"+".join(map("`{}`".format, v))}'
for k,v in d.items())
out = df.eval(query)
Output:
1_2 1_3 1_4 2_3 2_4 3_4 f_1 f_2 f_3 f_4
0 1 5 2 8 2 2 8 11 15 6
1 4 3 4 5 8 5 11 17 13 17
2 8 8 8 9 3 3 24 20 20 14
3 4 3 4 4 8 3 11 16 10 15
4 8 0 7 4 2 2 15 14 6 11
Intermediate d
:
{'1': ['1_2', '1_3', '1_4'],
'2': ['1_2', '2_3', '2_4'],
'3': ['1_3', '2_3', '3_4'],
'4': ['1_4', '2_4', '3_4'],
}
Pure python approach to build d
:
d = {}
for c in df:
for k in c.split('_'):
d.setdefault(k, []).append(c)
You could also imagine a pure pandas approach based on reshaping with melt
+pivot_table
, but this is most likely much less efficient:
out = df.join(df
.set_axis(df.columns.str.split('_'), axis=1)
.melt(ignore_index=False).explode('variable')
.reset_index()
.pivot_table(index='index', columns='variable',
values='value', aggfunc='sum')
.add_prefix('f_')
)