Search code examples
python-3.xpandasdataframepandas-groupbypandas-apply

Reducing dimensionality of multiindex pandas dataframe using apply


I have the following dataframe:

df = pd.DataFrame({('psl', 't1'): {'fiat': 36.389809173765507,
  'mazda': 18.139242981049016,
  'opel': 0.97626485600703961,
  'toyota': 74.464422292108878},
 ('psl', 't2'): {'fiat': 35.423004380643462,
  'mazda': 24.269803148695079,
  'opel': 1.0170540474994665,
  'toyota': 60.389948228586832},
 ('psv', 't1'): {'fiat': 35.836800462163097,
  'mazda': 15.893295606055901,
  'opel': 0.78744853046848606,
  'toyota': 74.054850828062271},
 ('psv', 't2'): {'fiat': 34.379812557124815,
  'mazda': 23.202587247335682,
  'opel': 0.80191294532382451,
  'toyota': 58.735083244244322}})

It looks like this: enter image description here

I wish to reduce it from a multiindex to a normal index. I wish to do this by applying a function using t1 and t2 values and returning only a single value which will result in there being two columns: psl and psv.

I have succeeded in grouping it as such and applying a function:

df.groupby(level=0, axis=1).agg(np.mean) 

which is very close to what I want except that I don't want to apply np.mean, but rather a custom function. In particular, a percent change function.

My end goal is to be able to do something like this:

df.groupby(level=0, axis=1).apply(lambda t1, t2: (t2-t1)/t1)

Which returns this error:

TypeError: <lambda>() missing 1 required positional argument: 't2'

I have also tried this:

df.apply(lambda x: x[x.name].apply(lambda x: x['t1']/x['t2']))

which in turn returns:

KeyError: (('psl', 't1'), 'occurred at index (psl, t1)')

Could you please include a thorough explanation of each part of your answer to the best of your abilities so I can better understand how pandas works.


Solution

  • Not easy. Use custom function with squeeze for Series and xs for select MultiIndex in columns:

    def f(x):
        t2 = x.xs('t2', axis=1, level=1)
        t1 = x.xs('t1', axis=1, level=1)
        a = (t2-t1)/t1
        #print (a)
        return (a.squeeze())
    
    df1 = df.groupby(level=0, axis=1).agg(f) 
    print (df1)
                 psl       psv
    fiat   -0.026568 -0.040656
    mazda   0.337972  0.459898
    opel    0.041781  0.018369
    toyota -0.189009 -0.206871
    

    Use lambda function is possible, but really awfull with repeating code:

    df1 = df.groupby(level=0, axis=1)
            .agg(lambda x: ((x.xs('t2', axis=1, level=1)-x.xs('t1', axis=1, level=1))/
                                     x.xs('t1', axis=1, level=1)).squeeze())