I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ind0": list("QQQWWWW"), "ind1": list("RRRRSSS"), "vals": range(7), "cols": list("XXYXXYY")})
print(df)
Output:
ind0 ind1 vals cols
0 Q R 0 X
1 Q R 1 X
2 Q R 2 Y
3 W R 3 X
4 W S 4 X
5 W S 5 Y
6 W S 6 Y
I want to aggregate the values while creating columns from col
, so I thought of using pivot_table
:
df_res = df.pivot_table(index=["ind0", "ind1"], columns="cols", values="vals", aggfunc=np.sum).fillna(0)
print(df_res)
This gives me:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R 3.0 0.0
S 4.0 11.0
However, I would rather get the sum independent of ind1
categories while keeping the information in this column. So, the desired output would be:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R,S 7.0 11.0
Is there a way to use pivot_table
or pivot
to this end or do I have to aggregate for ind1
in a second step? If the latter, how?
You could reset_index
of df_res
and groupby
"ind0" and using agg
, use different functions on columns: join
ing unique values of "ind1" and sum
ming "X" and "Y".
out = df_res.reset_index().groupby('ind0').agg({'ind1': lambda x: ', '.join(x.unique()), 'X':'sum', 'Y':'sum'})
Or if you have multiple columns that you need to do the same function on, you could also use a dict comprehension:
funcs = {'ind1': lambda x: ', '.join(x.unique()), **{k:'sum' for k in ('X','Y')}}
out = df_res.reset_index().groupby('ind0').agg(funcs)
Output:
cols ind1 X Y
ind0
Q R 1.0 2.0
W R, S 7.0 11.0