I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.
I've implemented this like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"afac": np.random.random(size=1000),
"bfac": np.random.random(size=1000),
"class":np.random.randint(low=0,high=5,size=1000)
})
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})
aggdf = df.groupby('class').apply(f)
My input data frame df
looks like:
>>> df
afac bfac class
0 0.689969 0.992403 0
1 0.688756 0.728763 1
2 0.086045 0.499061 1
3 0.078453 0.198435 2
4 0.621589 0.812233 4
But my code gives this multi-indexed data frame:
>>> aggdf
per_apop
class
0 0 0.553292
1 0 0.503112
2 0 0.444281
3 0 0.517646
4 0 0.503290
I've tried various ways of getting back to a "normal" data frame, but none seem to work.
>>> aggdf.reset_index()
class level_1 per_apop
0 0 0 0.553292
1 1 0 0.503112
2 2 0 0.444281
3 3 0 0.517646
4 4 0 0.503290
>>> aggdf.unstack().reset_index()
class per_apop
0
0 0 0.553292
1 1 0.503112
2 2 0.444281
3 3 0.517646
4 4 0.503290
How can I perform this operation and get a normal data frame afterwards?
Update: The output data frame should have columns for class
and per_apop
. Ideally, the function f
can return multiple columns and possibly multiple rows. Perhaps using
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})
You can select which level to reset as well as if you want to retain the index using reset_index
. In your case, you ended up with a multi-index that has 2 levels: class
and one that is unnamed. reset_index
allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True
it is dropped rather than appended as a column in the data frame.
aggdf.reset_index(level=-1, drop=True)
per_apop
class
0 0.476184
1 0.476254
2 0.509735
3 0.502444
4 0.525287
To push the class
level of the index back to the data frame, you can simply call .reset_index()
again. Ugly, but it work.
aggdf.reset_index(level=-1, drop=True).reset_index()
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Alternatively, you could also, reset the index, then just drop the extra column.
aggdf.reset_index().drop('level_1', axis=1)
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530