Search code examples
pythonpython-3.xpandas

Groupby and shift without using lambda function


I have a dataframe like so:

data = {
    'ID': [1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1],
    'timestamp': pd.date_range(start='1/1/2023', periods=14, freq='D'),
    'value': [11, 22, 33, 44, 55, 66, 77, 88, 99, 11, 22, 33, 44, 55]
}

My actual dataframe contains millions of rows. I sort the timestamp columns and so the ID column gets intersperesed when you look at the raw dataframe.

I want to groupby ID and find the difference between each row and the 3rd previous row. I currently have it working like so:

# Sort by ID and timestamp
df = df.sort_values(by=['ID', 'timestamp'])

# Group by 'ID' and calculate the difference with the 5th previous row
df['value_diff'] = df.groupby('ID', group_keys=False)['value'].apply(lambda x: x - x.shift(3))

However since my actual dataframe is huge, it takes quite a bit of time. I also read that using lambda is slow. Eventually I want to filter based on whether the value_diff column is increasing or decresing. Typically I would use

inc_check=df['value'].diff(3).ge(0)
df=df[inc_check]

but it doesnt groupby to calculate the difference.

Is there a more elgant way to achieve this?


Solution

  • Don't use apply, just go with groupby.shift directly and then subtract to the original column:

    df['value_diff'] = df['value'] - df.groupby('ID')['value'].shift(3)
    

    Or groupby.diff:

    df['value_diff'] = df.groupby('ID')['value'].diff(3)
    

    Output:

        ID  timestamp  value  value_diff
    0    1 2023-01-01     11         NaN
    1    1 2023-01-02     22         NaN
    3    1 2023-01-04     44         NaN
    5    1 2023-01-06     66        55.0
    6    1 2023-01-07     77        55.0
    10   1 2023-01-11     22       -22.0
    13   1 2023-01-14     55       -11.0
    2    2 2023-01-03     33         NaN
    4    2 2023-01-05     55         NaN
    7    2 2023-01-08     88         NaN
    8    2 2023-01-09     99        66.0
    9    2 2023-01-10     11       -44.0
    11   2 2023-01-12     33       -55.0
    12   2 2023-01-13     44       -55.0