Search code examples
pythonpandasdataframenumpycalculated-columns

Difference between subsequent values in rows with same code: value error


I have a DataFrame with 3 columns: POD (which is a code), timestamp, EAI_ALL (number). I want to calculate a 4th column where every row has the following value: the value of EAI_ALL of the current row minus the value of EAI_ALL on the previous row. This has to be done for every code (e.g. if the current code is 2 and the code in the previous row is 1, the calculated value has to be 0 because the code is different). I managed to do this operation with the following code:

#group the DF by POD code
grouped = df.groupby('POD')


#Define Lambda function
eai_diff = lambda x: x.EAI_ALL - x.EAI_ALL.shift(+1)

df['EAI_Delta'] = grouped.apply(eai_diff).reset_index(0, drop=True).fillna(0)

This works fine, unless I have only one POD code in the DataFrame. I keep getting this error if I apply the function to a DataFrame with only one POD code.

ValueError: Wrong number of items passed 48, placement implies 1

As you can see df has 48 rows, but with the group by it gets reduced to one row. I need the groupby because if I have more than one POD code the operation basically has to be done for one POD code at a time.

Does anyone have any suggestion? Thanks in advance!


Solution

  • This is an error that comes back regularly, which comes from the fact that returning a series from a dataframe apply with a single group returns a dataframe with a single row, and the index as columns:

    >>> df
       POD  EAI_ALL
    0  foo        0
    1  foo        1
    2  foo        2
    3  foo        3
    4  foo        4
    5  foo        5
    6  foo        6
    7  foo        7
    8  foo        8
    >>> df.groupby('POD').apply(lambda x: x.EAI_ALL - x.EAI_ALL.shift(+1))
    EAI_ALL   0    1    2    3    4    5    6    7    8
    POD                                                
    foo     NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
    

    To avoid that, and if you really need .apply, you can return a dataframe instead:

    >>> df.groupby('POD').apply(lambda x: x[['EAI_ALL']] - x[['EAI_ALL']].shift(+1))
       EAI_ALL
    0      NaN
    1      1.0
    2      1.0
    3      1.0
    4      1.0
    5      1.0
    6      1.0
    7      1.0
    8      1.0
    

    However the better way is to try and stay away from .apply. For your use case, groupby.shift() works perfectly and only shifts within the group as you expect:

    >>> df.groupby('POD')['EAI_ALL'].shift()
    0  NaN
    1    0
    2    1
    3    2
    4    3
    5    4
    6    5
    >>> df['EAI_Delta'] = (df['EAI_ALL'] - df.groupby('POD')['EAI_ALL'].shift()).fillna(0)
    >>> df
       POD  EAI_ALL  EAI_Delta
    0  foo        0        0.0
    1  foo        1        1.0
    2  foo        2        1.0
    3  foo        3        1.0
    4  foo        4        1.0
    5  foo        5        1.0
    6  foo        6        1.0
    7  foo        7        1.0
    8  foo        8        1.0
    

    And an example with several POD values:

    >>> df2['EAI_Delta'] = (df2['EAI_ALL'] - df2.groupby('POD')['EAI_ALL'].shift()).fillna(0)
    >>> df2
       POD  EAI_ALL  EAI_Delta
    0  foo        0        0.0
    1  foo        1        1.0
    2  foo        2        1.0
    3  foo        3        1.0
    4  bar        4        0.0
    5  bar        5        1.0
    6  bar        6        1.0
    7  bar        7        1.0
    8  bar        8        1.0
    

    Even better as suggested by @mandiatodos in comments, you can directly use .groupby().diff()