How to calculate distance to nearest flagged row by date difference in pandas?

I need to calculate the distance to the nearest christmas by date, not just naively by number of rows. My toy data has the columns: road_id, traffic, date, and is_christmas.

I can do something like the following:

import pandas as pd
import numpy as np

# create example dataframe
df = pd.DataFrame({
    'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
    'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
    'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})

# find the nearest christmas day for each row
df['nearest_christmas'] = np.nan
for i, row in df.iterrows():
    if row['is_christmas'] == 1:
        df.at[i, 'nearest_christmas'] = 0
    else:
        nearest_christmas_index = (df.loc[df['is_christmas'] == 1, 'date'] - row['date']).abs().idxmin()
        df.at[i, 'nearest_christmas'] = (df.at[nearest_christmas_index, 'date'] - row['date']).days

print(df)

but this doesn't take into account the different road_ids (which would matter if I did it rows-wise), and also doesn't seem right and seems over-engineered. It is also slow as hell on my real data, which is quite large.

My next attempt:

import pandas as pd

# create example dataframe
df = pd.DataFrame({
    'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
    'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
    'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})

df['nearest_christmas'] = df.groupby('is_christmas')['date'].transform(lambda x: x.diff().abs().dt.days)

print(df)

Does it by date... but gives me the wrong values A and B I can't figure out how to get to work with road_id. It may not actually even need to use road_id, but there ARE multiple same-date entries with the is_christmas flag set to 1.

Solution

You can use a self-merge_asof (between the original dataframe and only the is_christmas dates using boolean indexing):

m = df['is_christmas'].eq(1)

out = (pd.merge_asof(df.reset_index().sort_values(by='date'),
                     df.loc[m, ['date']].sort_values(by='date')
                       .rename(columns={'date': 'nearest_christmas'}),
                     left_on='date', right_on='nearest_christmas',
                     direction='nearest')
         .set_index('index').loc[df.index]
      )

NB. if traffic should be the same for the two dates, add by='road_id in the merge_asof, make sure to slice the column in the right DataFrame using df.loc[m, ['date', 'road_id']].

Output:

   traffic       date  is_christmas nearest_christmas
0      100 2021-12-24             0        2021-12-25
1      200 2021-12-25             1        2021-12-25
2      150 2021-12-26             0        2021-12-25
3      300 2021-12-27             0        2021-12-25
4      250 2021-12-28             0        2021-12-29
5      400 2021-12-29             1        2021-12-29
6      350 2021-12-30             0        2021-12-29
7      500 2021-12-31             0        2021-12-29

If you want the difference of days, subtract the original date and get the number of days with dt.days:

out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days

# or for the absolute difference
out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days.abs()

Output:

   traffic       date  is_christmas  nearest_christmas
0      100 2021-12-24             0                  1
1      200 2021-12-25             1                  0
2      150 2021-12-26             0                 -1
3      300 2021-12-27             0                 -2
4      250 2021-12-28             0                  1
5      400 2021-12-29             1                  0
6      350 2021-12-30             0                 -1
7      500 2021-12-31             0                 -2