Search code examples
python-3.xpandasdataframe

fill nearest value in a column when null of pandas data frame


Say we have a pandas data frame like below

ID    Name    DateTime               Days
------------------------------------------
1    AAA     2022-06-22 10:00:05     12
2    BBB     2022-06-22 10:02:00     
3    CCC     2022-06-22 10:04:00     16
4    DDD     2022-06-22 10:05:00     
5    EEE     2022-06-22 10:05:59     18

so need to fill the Days column with nearest Value depends on datetime like for row #2 nearest value is in row #1 as the difference in time is less than the row #3. The result should be like below table. other rules like

  1. for first row it should be next possible value
  2. for last row it should be before possible value
  3. if both times are equal take the before value
ID    Name    DateTime               Days
------------------------------------------
1    AAA     2022-06-22 10:00:05     12
2    BBB     2022-06-22 10:02:00     12
3    CCC     2022-06-22 10:04:00     16
4    DDD     2022-06-22 10:05:00     18
5    EEE     2022-06-22 10:05:59     18

Tried to do it in for loop but, taking more time as I have process 30k records. Is there any other way of doing it? Edit:

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame([{"Id": 1, "Name": "AAA", "DateTime":pd.Timestamp('2022-06-22 10:00:05'), "Days": 12},
                  {"Id": 2, "Name": "BBB", "DateTime":pd.Timestamp('2022-06-22 10:02:00'), "Days": None},
                  {"Id": 3, "Name": "CCC", "DateTime":pd.Timestamp('2022-06-22 10:04:00'), "Days": 16},
                  {"Id": 4, "Name": "DDD", "DateTime":pd.Timestamp('2022-06-22 10:05:00'), "Days": None},
                  {"Id": 5, "Name": "EEE", "DateTime":pd.Timestamp('2022-06-22 10:05:59'), "Days": 18}])

df['TimeUpShift'] = df['DateTime'].shift(1)
df['TimeDownShift'] = df['DateTime'].shift(-1)

Not able to proceed after this

for loop logic which serving the same

    if df['Days'][0] == '' or np.isnan(df['Days'][0]):
    index = df['Days'].first_valid_index()
    df['Days'][0] = df.loc[index].Days if index is not None else None
if df['Days'][len(df) - 1] == '' or np.isnan(df['Days'][len(df) - 1]):
    index = df['Days'].last_valid_index()
    df['Days'][len(df) - 1] = df.loc[index].Days if index is not None else None

for i in range(1, len(df) - 1, 1):
    if df['Days'][i] == '' or np.isnan(df['Days'][i]):
        prevrow = pd.DataFrame()
        nextrow = pd.DataFrame()
        # backward search
        for pi in range(i - 1, -1, -1):
            if not df['Days'][pi] == '' or not np.isnan(df['Days'][pi]):
                prevrow = df.loc[pi]
                break
        # forward search
        for ni in range(i + 1, len(df) + 1, 1):
            if not df['Days'][ni] == '' or not np.isnan(df['Days'][ni]):
                nextrow = df.loc[ni]
                break
        if df['DateTime'][i] - prevrow['DateTime'] <= nextrow['DateTime'] - df['DateTime'][i]:
            df['Days'][i] = prevrow['Days']
        else:
            df['Days'][i] = nextrow['Days']

Solution

  • With merge_asof you can merge (match) to the nearest DateTime:

    pd.merge_asof(df,df.dropna(), on='DateTime', direction='nearest', suffixes=('_x', '')) \
        [['Id', 'Name', 'DateTime', 'Days']]
    
       Id Name            DateTime  Days
    0   1  AAA 2022-06-22 10:00:05  12.0
    1   1  AAA 2022-06-22 10:02:00  12.0
    2   3  CCC 2022-06-22 10:04:00  16.0
    3   5  EEE 2022-06-22 10:05:00  18.0
    4   5  EEE 2022-06-22 10:05:59  18.0
    

    asof merge within a given tolerance

    You can specify a merge tolerance with the tolerance= parameter specifying a TimeDelta:

    pd.merge_asof(df,df.dropna(), on='DateTime', direction='nearest', tolerance=pd.Timedelta('1m'), suffixes=('', '_y')) \
        .drop(columns=['Days','Id_y', 'Name_y']).rename(columns={'Days_y':'Days'})
    

    Result:

       Id Name            DateTime  Days
    0   1  AAA 2022-06-22 10:00:05  12.0
    1   2  BBB 2022-06-22 10:02:00   NaN
    2   3  CCC 2022-06-22 10:04:00  16.0
    3   4  DDD 2022-06-22 10:05:00  18.0
    4   5  EEE 2022-06-22 10:05:59  18.0
    

    The example above shows merge/matching for the nearest time if that time is within 1 minute. For more details on TimeDelta see: https://pandas.pydata.org/docs/user_guide/timedeltas.html