I need to calculate the distance to the nearest christmas by date, not just naively by number of rows. My toy data has the columns: road_id, traffic, date, and is_christmas.
I can do something like the following:
import pandas as pd
import numpy as np
# create example dataframe
df = pd.DataFrame({
'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})
# find the nearest christmas day for each row
df['nearest_christmas'] = np.nan
for i, row in df.iterrows():
if row['is_christmas'] == 1:
df.at[i, 'nearest_christmas'] = 0
else:
nearest_christmas_index = (df.loc[df['is_christmas'] == 1, 'date'] - row['date']).abs().idxmin()
df.at[i, 'nearest_christmas'] = (df.at[nearest_christmas_index, 'date'] - row['date']).days
print(df)
but this doesn't take into account the different road_ids (which would matter if I did it rows-wise), and also doesn't seem right and seems over-engineered. It is also slow as hell on my real data, which is quite large.
My next attempt:
import pandas as pd
# create example dataframe
df = pd.DataFrame({
'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})
df['nearest_christmas'] = df.groupby('is_christmas')['date'].transform(lambda x: x.diff().abs().dt.days)
print(df)
Does it by date... but gives me the wrong values A and B I can't figure out how to get to work with road_id. It may not actually even need to use road_id, but there ARE multiple same-date entries with the is_christmas flag set to 1.
You can use a self-merge_asof
(between the original dataframe and only the is_christmas
dates using boolean indexing):
m = df['is_christmas'].eq(1)
out = (pd.merge_asof(df.reset_index().sort_values(by='date'),
df.loc[m, ['date']].sort_values(by='date')
.rename(columns={'date': 'nearest_christmas'}),
left_on='date', right_on='nearest_christmas',
direction='nearest')
.set_index('index').loc[df.index]
)
NB. if traffic
should be the same for the two dates, add by='road_id
in the merge_asof
, make sure to slice the column in the right DataFrame using df.loc[m, ['date', 'road_id']]
.
Output:
traffic date is_christmas nearest_christmas
0 100 2021-12-24 0 2021-12-25
1 200 2021-12-25 1 2021-12-25
2 150 2021-12-26 0 2021-12-25
3 300 2021-12-27 0 2021-12-25
4 250 2021-12-28 0 2021-12-29
5 400 2021-12-29 1 2021-12-29
6 350 2021-12-30 0 2021-12-29
7 500 2021-12-31 0 2021-12-29
If you want the difference of days, subtract the original date and get the number of days with dt.days
:
out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days
# or for the absolute difference
out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days.abs()
Output:
traffic date is_christmas nearest_christmas
0 100 2021-12-24 0 1
1 200 2021-12-25 1 0
2 150 2021-12-26 0 -1
3 300 2021-12-27 0 -2
4 250 2021-12-28 0 1
5 400 2021-12-29 1 0
6 350 2021-12-30 0 -1
7 500 2021-12-31 0 -2