python pandas dataframe numpy calculated-columns

Difference between subsequent values in rows with same code: value error

I have a DataFrame with 3 columns: POD (which is a code), timestamp, EAI_ALL (number). I want to calculate a 4th column where every row has the following value: the value of EAI_ALL of the current row minus the value of EAI_ALL on the previous row. This has to be done for every code (e.g. if the current code is 2 and the code in the previous row is 1, the calculated value has to be 0 because the code is different). I managed to do this operation with the following code:

#group the DF by POD code
grouped = df.groupby('POD')


#Define Lambda function
eai_diff = lambda x: x.EAI_ALL - x.EAI_ALL.shift(+1)

df['EAI_Delta'] = grouped.apply(eai_diff).reset_index(0, drop=True).fillna(0)

This works fine, unless I have only one POD code in the DataFrame. I keep getting this error if I apply the function to a DataFrame with only one POD code.

ValueError: Wrong number of items passed 48, placement implies 1

As you can see df has 48 rows, but with the group by it gets reduced to one row. I need the groupby because if I have more than one POD code the operation basically has to be done for one POD code at a time.

Does anyone have any suggestion? Thanks in advance!

Solution

This is an error that comes back regularly, which comes from the fact that returning a series from a dataframe apply with a single group returns a dataframe with a single row, and the index as columns:

>>> df
   POD  EAI_ALL
0  foo        0
1  foo        1
2  foo        2
3  foo        3
4  foo        4
5  foo        5
6  foo        6
7  foo        7
8  foo        8
>>> df.groupby('POD').apply(lambda x: x.EAI_ALL - x.EAI_ALL.shift(+1))
EAI_ALL   0    1    2    3    4    5    6    7    8
POD                                                
foo     NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

To avoid that, and if you really need .apply, you can return a dataframe instead:

>>> df.groupby('POD').apply(lambda x: x[['EAI_ALL']] - x[['EAI_ALL']].shift(+1))
   EAI_ALL
0      NaN
1      1.0
2      1.0
3      1.0
4      1.0
5      1.0
6      1.0
7      1.0
8      1.0

However the better way is to try and stay away from .apply. For your use case, groupby.shift() works perfectly and only shifts within the group as you expect:

>>> df.groupby('POD')['EAI_ALL'].shift()
0  NaN
1    0
2    1
3    2
4    3
5    4
6    5
>>> df['EAI_Delta'] = (df['EAI_ALL'] - df.groupby('POD')['EAI_ALL'].shift()).fillna(0)
>>> df
   POD  EAI_ALL  EAI_Delta
0  foo        0        0.0
1  foo        1        1.0
2  foo        2        1.0
3  foo        3        1.0
4  foo        4        1.0
5  foo        5        1.0
6  foo        6        1.0
7  foo        7        1.0
8  foo        8        1.0

And an example with several POD values:

>>> df2['EAI_Delta'] = (df2['EAI_ALL'] - df2.groupby('POD')['EAI_ALL'].shift()).fillna(0)
>>> df2
   POD  EAI_ALL  EAI_Delta
0  foo        0        0.0
1  foo        1        1.0
2  foo        2        1.0
3  foo        3        1.0
4  bar        4        0.0
5  bar        5        1.0
6  bar        6        1.0
7  bar        7        1.0
8  bar        8        1.0

Even better as suggested by @mandiatodos in comments, you can directly use .groupby().diff()