Search code examples
datetimepandastimedelta

Compare dates with null values


I have two columns. I want to check if the difference between them is between 0 and 10 days. One of the fields often contains null values.

df['Diff'] = (df['Dt1'] - df['Dt2'])

def wdw(x):
    if pd.notnull(x):
        if type(x) !=long:
            if type(timedelta(days=10)) != long:
                if x > timedelta(days=10):
                    return 1
    else:
        return 0

df['Diff'].df(wdw)

When I run this I get the following error.

TypeError: can't compare datetime.timedelta to long

When I look at the values of df['Diff'] They all appear to be timedeltas. Any idea what is going on here? It seems like creating an indicator based on the difference between two date fields should be easier than this...


Solution

  • The values in df['Diff'] are numpy timedelta64s. You can compare them with pd.Timedeltas; see below.

    Moreover, you do not need to call df['Diff'].apply(wdw), which calls wdw for each value in the Series; you can compare whole Series with a pd.Timedelta:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'Dt1':pd.date_range('2010-1-1', freq='5D', periods=10),
                       'Dt2':pd.date_range('2010-1-2', freq='3D', periods=10)})
    df.iloc[::3, 1] = np.nan
    
    df['Diff'] = df['Dt1'] - df['Dt2']
    print(df)
    
    #          Dt1        Dt2    Diff
    # 0 2010-01-01        NaT     NaT
    # 1 2010-01-06 2010-01-05  1 days
    # 2 2010-01-11 2010-01-08  3 days
    # 3 2010-01-16        NaT     NaT
    # 4 2010-01-21 2010-01-14  7 days
    # 5 2010-01-26 2010-01-17  9 days
    # 6 2010-01-31        NaT     NaT
    # 7 2010-02-05 2010-01-23 13 days
    # 8 2010-02-10 2010-01-26 15 days
    # 9 2010-02-15        NaT     NaT
    
    mask = (df['Diff'] < pd.Timedelta(days=10)) & (pd.Timedelta(days=0) < df['Diff'])
    print(mask)
    

    yields

    0    False
    1     True
    2     True
    3    False
    4     True
    5     True
    6    False
    7    False
    8    False
    9    False
    Name: Diff, dtype: bool
    

    pd.Timedeltas were introduced in Pandas v.0.15. Here is a workaround for older version of Pandas which uses np.timedela64s:

    mask = ((df['Diff'] / np.timedelta64(10, 'D') < 10)
            & (df['Diff'] / np.timedelta64(10, 'D') > 0))