Search code examples
pythonpandasapplypandas-groupbymulti-index

Recover a standard, single-index data frame after using pandas groupby+apply


I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.

I've implemented this like so:

import pandas as pd
import numpy as np

df = pd.DataFrame(data={
  "afac": np.random.random(size=1000),
  "bfac": np.random.random(size=1000),
  "class":np.random.randint(low=0,high=5,size=1000)
})

def f(group):
  total_area = group['afac'].sum()
  per_area   = (group['afac']/total_area).values
  per_pop    = group['bfac'].values
  return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})

aggdf = df.groupby('class').apply(f)

My input data frame df looks like:

>>> df
         afac      bfac  class
0    0.689969  0.992403      0
1    0.688756  0.728763      1
2    0.086045  0.499061      1
3    0.078453  0.198435      2
4    0.621589  0.812233      4

But my code gives this multi-indexed data frame:

>>> aggdf
         per_apop
class            
0     0  0.553292
1     0  0.503112
2     0  0.444281
3     0  0.517646
4     0  0.503290

I've tried various ways of getting back to a "normal" data frame, but none seem to work.

>>> aggdf.reset_index()
   class  level_1  per_apop
0      0        0  0.553292
1      1        0  0.503112
2      2        0  0.444281
3      3        0  0.517646
4      4        0  0.503290

>>> aggdf.unstack().reset_index()
  class  per_apop
                0
0     0  0.553292
1     1  0.503112
2     2  0.444281
3     3  0.517646
4     4  0.503290

How can I perform this operation and get a normal data frame afterwards?

Update: The output data frame should have columns for class and per_apop. Ideally, the function f can return multiple columns and possibly multiple rows. Perhaps using

return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})

Solution

  • You can select which level to reset as well as if you want to retain the index using reset_index. In your case, you ended up with a multi-index that has 2 levels: class and one that is unnamed. reset_index allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True it is dropped rather than appended as a column in the data frame.

    aggdf.reset_index(level=-1, drop=True)
    
           per_apop
    class
    0      0.476184
    1      0.476254
    2      0.509735
    3      0.502444
    4      0.525287
    

    EDIT:

    To push the class level of the index back to the data frame, you can simply call .reset_index() again. Ugly, but it work.

    aggdf.reset_index(level=-1, drop=True).reset_index()
    
       class  per_apop
    0      0  0.515733
    1      1  0.497349
    2      2  0.527063
    3      3  0.515476
    4      4  0.494530
    

    Alternatively, you could also, reset the index, then just drop the extra column.

    aggdf.reset_index().drop('level_1', axis=1)
    
    
       class  per_apop
    0      0  0.515733
    1      1  0.497349
    2      2  0.527063
    3      3  0.515476
    4      4  0.494530