Search code examples
pythonpandasdatetimegroup-by

How to calculate distance to nearest flagged row by date difference in pandas?


I need to calculate the distance to the nearest christmas by date, not just naively by number of rows. My toy data has the columns: road_id, traffic, date, and is_christmas.

I can do something like the following:

import pandas as pd
import numpy as np

# create example dataframe
df = pd.DataFrame({
    'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
    'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
    'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})

# find the nearest christmas day for each row
df['nearest_christmas'] = np.nan
for i, row in df.iterrows():
    if row['is_christmas'] == 1:
        df.at[i, 'nearest_christmas'] = 0
    else:
        nearest_christmas_index = (df.loc[df['is_christmas'] == 1, 'date'] - row['date']).abs().idxmin()
        df.at[i, 'nearest_christmas'] = (df.at[nearest_christmas_index, 'date'] - row['date']).days

print(df)

but this doesn't take into account the different road_ids (which would matter if I did it rows-wise), and also doesn't seem right and seems over-engineered. It is also slow as hell on my real data, which is quite large.

My next attempt:

import pandas as pd

# create example dataframe
df = pd.DataFrame({
    'traffic': [100, 200, 150, 300, 250, 400, 350, 500],
    'date': pd.date_range(start='2021-12-24', periods=8, freq='D'),
    'is_christmas': [0, 1, 0, 0, 0, 1, 0, 0]
})

df['nearest_christmas'] = df.groupby('is_christmas')['date'].transform(lambda x: x.diff().abs().dt.days)

print(df)

Does it by date... but gives me the wrong values A and B I can't figure out how to get to work with road_id. It may not actually even need to use road_id, but there ARE multiple same-date entries with the is_christmas flag set to 1.


Solution

  • You can use a self-merge_asof (between the original dataframe and only the is_christmas dates using boolean indexing):

    m = df['is_christmas'].eq(1)
    
    out = (pd.merge_asof(df.reset_index().sort_values(by='date'),
                         df.loc[m, ['date']].sort_values(by='date')
                           .rename(columns={'date': 'nearest_christmas'}),
                         left_on='date', right_on='nearest_christmas',
                         direction='nearest')
             .set_index('index').loc[df.index]
          )
    

    NB. if traffic should be the same for the two dates, add by='road_id in the merge_asof, make sure to slice the column in the right DataFrame using df.loc[m, ['date', 'road_id']].

    Output:

       traffic       date  is_christmas nearest_christmas
    0      100 2021-12-24             0        2021-12-25
    1      200 2021-12-25             1        2021-12-25
    2      150 2021-12-26             0        2021-12-25
    3      300 2021-12-27             0        2021-12-25
    4      250 2021-12-28             0        2021-12-29
    5      400 2021-12-29             1        2021-12-29
    6      350 2021-12-30             0        2021-12-29
    7      500 2021-12-31             0        2021-12-29
    

    If you want the difference of days, subtract the original date and get the number of days with dt.days:

    out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days
    
    # or for the absolute difference
    out['nearest_christmas'] = out['nearest_christmas'].sub(out['date']).dt.days.abs()
    

    Output:

       traffic       date  is_christmas  nearest_christmas
    0      100 2021-12-24             0                  1
    1      200 2021-12-25             1                  0
    2      150 2021-12-26             0                 -1
    3      300 2021-12-27             0                 -2
    4      250 2021-12-28             0                  1
    5      400 2021-12-29             1                  0
    6      350 2021-12-30             0                 -1
    7      500 2021-12-31             0                 -2