Search code examples
pandasdataframegroup-bymulti-indexpandas-apply

index compatibility of dataframe with multiindex result from apply on group


We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.

Following code reflects our approach to the problem in a simplified manner.

import numpy as np
import pandas as pd

np.random.seed(42)

N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)

df = pd.DataFrame(data)

This generates a DataFrame as follows.

    key         x
0     0  0.969585
1     1  0.775133
2     1  0.939499
3     1  0.894827
4     1  0.597900
..  ...       ...
95   53  0.036887
96   54  0.609564
97   55  0.502679
98   56  0.051479
99   56  0.278646

Application of exemplary methods on the DataFrame groups.

def magic(x, const):
    return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)

def pandas_confrom_magic(df_per_key, const=1):
    index = df_per_key['x'].index # preserve index
    x = df_per_key['x'].to_numpy()
    y = magic(x, const) # perform some pandas incompatible magic
    return pd.Series(y, index=index) # reconstruct index

g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))

When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.

TypeError: incompatible index of inserted column with frame index

Thus a compatible multiindex needs to be introduced first.

df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)

Which yields the intended result.

       key         x    y
index                    
0        0  0.969585  6.9
1        1  0.775133  6.0
2        1  0.939499  6.1
3        1  0.894827  6.4
4        1  0.597900  6.6
...    ...       ...  ...
95      53  0.036887  6.0
96      54  0.609564  6.0
97      55  0.502679  6.5
98      56  0.051479  6.0
99      56  0.278646  6.1

Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.


Solution

  • Use Series.droplevel to remove first level of MultiIndex, such that it has the same index as df, then assign will working well:

    g = df.groupby('key')
    df['y']  = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
    print (df)
    
        key         x    y
    0     0  0.969585  6.9
    1     1  0.775133  6.0
    2     1  0.939499  6.1
    3     1  0.894827  6.4
    4     1  0.597900  6.6
    ..  ...       ...  ...
    95   53  0.036887  6.0
    96   54  0.609564  6.0
    97   55  0.502679  6.5
    98   56  0.051479  6.0
    99   56  0.278646  6.1
    
    [100 rows x 3 columns]