I have a dataset named df with the following structure:
date | word | total | top |
---|---|---|---|
14/10/2023 | python | 52 | 5 |
15/10/2023 | python | 54 | 9 |
16/10/2023 | R | 52 | 2 |
17/10/2023 | R | 12 | 1 |
18/10/2023 | R | 45 | 1 |
I need to manipulate the data in a way that I can create two new columns:
The column "delta_top" should display the difference in the "top" column from the current date to the previous date for that word.
The column "delta_total" should show the percentage increase or decrease for that word on the current date compared to the previous date. Note that if there is no previous date, we should assign "NA."
For instance, for the word "R," the oldest date of reference is 16/10/2023, so we can't calculate its "delta_top" or "delta_total." Therefore, we assign "NA." But on date 17/10/2023, the word "R" went from top 2 to top 1, so we subtract the previous value from the current value, resulting in 1 (it has gone up 1 spot). However, "delta_total" will show -0.76, indicating that the subtotal went down by 76%.
My desired output would look like this:
date | word | total | top | delta_top | delta_total |
---|---|---|---|---|---|
14/10/2023 | python | 52 | 5 | NA | NA |
15/10/2023 | python | 54 | 9 | 4 | 0.037037037 |
16/10/2023 | R | 52 | 2 | NA | NA |
17/10/2023 | R | 12 | 1 | 1 | -0.769230769 |
18/10/2023 | R | 45 | 1 | 0 | 2.75 |
I have been trying to create this table using the chaining method in pandas, but I always get an error. The truth is that my actual dataset has over 3 million records, so I need to come up with a fast and convenient solution. I'm quite new to Python.
Work around:
df = (
df.assign(date=pd.to_datetime(df['date'], format='%d/%m/%Y'))
.sort_values(by=['word', 'date'])
.assign(delta_top=lambda x: x['top'] - x.groupby('word')['top'].shift(1),
delta_total=lambda x: ((x['total'] - x.groupby('word')['total'].shift(1)) / x.groupby('word')['total'].shift(1)).fillna('NA'))
I feel like I'm following a proper logic way but this code is taking forever to load.
try this:
grouped = df.groupby('word', as_index=False, group_keys=False)
df['delta_top'] = grouped['top'].apply(lambda x: x.diff().abs())
df['delta_total'] = grouped['total'].apply(lambda x: x.pct_change())
print(df)
>>>
date word total top delta_top delta_total
0 14/10/2023 python 52 5 NaN NaN
1 15/10/2023 python 54 9 4.0 0.038462
2 16/10/2023 R 52 2 NaN NaN
3 17/10/2023 R 12 1 1.0 -0.769231
4 18/10/2023 R 45 1 0.0 2.750000