Search code examples
pythonpandasgroup-bydata-wrangling

How to calculate deltas and percentage changes in Pandas grouping on date by specific conditions


I have a dataset named df with the following structure:

date word total top
14/10/2023 python 52 5
15/10/2023 python 54 9
16/10/2023 R 52 2
17/10/2023 R 12 1
18/10/2023 R 45 1

I need to manipulate the data in a way that I can create two new columns:

The column "delta_top" should display the difference in the "top" column from the current date to the previous date for that word.

The column "delta_total" should show the percentage increase or decrease for that word on the current date compared to the previous date. Note that if there is no previous date, we should assign "NA."

For instance, for the word "R," the oldest date of reference is 16/10/2023, so we can't calculate its "delta_top" or "delta_total." Therefore, we assign "NA." But on date 17/10/2023, the word "R" went from top 2 to top 1, so we subtract the previous value from the current value, resulting in 1 (it has gone up 1 spot). However, "delta_total" will show -0.76, indicating that the subtotal went down by 76%.

My desired output would look like this:

date word total top delta_top delta_total
14/10/2023 python 52 5 NA NA
15/10/2023 python 54 9 4 0.037037037
16/10/2023 R 52 2 NA NA
17/10/2023 R 12 1 1 -0.769230769
18/10/2023 R 45 1 0 2.75

I have been trying to create this table using the chaining method in pandas, but I always get an error. The truth is that my actual dataset has over 3 million records, so I need to come up with a fast and convenient solution. I'm quite new to Python.

Work around:

df = (
    df.assign(date=pd.to_datetime(df['date'], format='%d/%m/%Y'))
      .sort_values(by=['word', 'date'])
      .assign(delta_top=lambda x: x['top'] - x.groupby('word')['top'].shift(1),
              delta_total=lambda x: ((x['total'] - x.groupby('word')['total'].shift(1)) / x.groupby('word')['total'].shift(1)).fillna('NA'))

I feel like I'm following a proper logic way but this code is taking forever to load.


Solution

  • try this:

    grouped = df.groupby('word', as_index=False, group_keys=False)
    df['delta_top'] = grouped['top'].apply(lambda x: x.diff().abs())
    df['delta_total'] = grouped['total'].apply(lambda x: x.pct_change())
    print(df)
    >>>
             date    word  total  top  delta_top  delta_total
    0  14/10/2023  python     52    5        NaN          NaN
    1  15/10/2023  python     54    9        4.0     0.038462
    2  16/10/2023       R     52    2        NaN          NaN
    3  17/10/2023       R     12    1        1.0    -0.769231
    4  18/10/2023       R     45    1        0.0     2.750000