Search code examples
pythonpandasnumpypython-datetimepandas-groupby

Calculate difference of 2 dates in a pandas groupby object of the same 2 dates


I'm trying to create a new pandas.DataFrame column of the number of business days between two date columns. I'm unable to reference the dates in the date columns as arguments in a function call (I get a TypeError: Cannot convert input error). However, I'm able to zip the values in the series into a List and use a For Loop to reference the parameters. Ideally, I would prefer to create a GroupBy object from the two Date columns and calculate the difference.

Create DataFrame:

import pandas as pd

df = pd.DataFrame.from_dict({'Date1': ['2017-05-30 16:00:00',
  '2017-05-30 16:00:00',
  '2017-05-30 16:00:00'],
 'Date2': ['2017-06-16 16:00:00',
  '2017-07-21 16:00:00',
  '2017-08-18 16:00:00'],
 'Value1': [2.97, 3.3, 4.03],
 'Value2': [96L, 14L, 2L]})

df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

df.dtypes

Validate DataFrame:

Date1     datetime64[ns]
Date2     datetime64[ns]
Value1           float64
Value2             int64
dtype: object

Define function:

def date_diff(startDate, endDate):
    return float(len(pd.bdate_range(startDate, endDate)) - 1)

Attempt to column from the result of the date_diff function call:

df['DateDiff'] = date_diff(df['Date1'], df['Date2'])

TypeError:

TypeError: Cannot convert input [0   2017-05-30 16:00:00
1   2017-05-30 16:00:00
2   2017-05-30 16:00:00
Name: Date1, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp

A "For Loop" referencing a list of tuples containing the dates works:

date_List = list(zip(df['Date1'], df['Date2']))

for i in range(len(date_List)):
    df.loc[(df['Date1'] == date_List[i][0]) & (df['Date2'] == date_List[i][1]), 'diff'] = date_diff(date_List[i][0], date_List[i][1])

                Date1               Date2  Value1  Value2  diff
0 2017-05-30 16:00:00 2017-06-16 16:00:00    2.97      96  13.0
1 2017-05-30 16:00:00 2017-07-21 16:00:00    3.30      14  38.0
2 2017-05-30 16:00:00 2017-08-18 16:00:00    4.03       2  58.0

Ideally, I'd like to utilize a GroupBy object (by Date1 & Date2):

grp = df.groupby(['Date1', 'Date2'])

Desired Output:

[((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-06-16 16:00:00')),
                  Date1               Date2  Value1  Value2  diff
  0 2017-05-30 16:00:00 2017-06-16 16:00:00    2.97      96  13.0),
 ((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-07-21 16:00:00')),
                  Date1               Date2  Value1  Value2  diff
  1 2017-05-30 16:00:00 2017-07-21 16:00:00     3.3      14  38.0),
 ((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-08-18 16:00:00')),
                  Date1               Date2  Value1  Value2  diff
  2 2017-05-30 16:00:00 2017-08-18 16:00:00    4.03       2  58.0)]

Solution

  • You need a type cast to datetime64[D] to make numpy happy like:

    Code:

    import numpy as np
    
    def date_diff(start_dates, end_dates):
        return np.busday_count(
            start_dates.values.astype('datetime64[D]'),
            end_dates.values.astype('datetime64[D]'))
    

    Test Code:

    import pandas as pd
    df = pd.DataFrame.from_dict({'Date1': ['2017-05-30 16:00:00',
                                           '2017-05-30 16:00:00',
                                           '2017-05-30 16:00:00'],
                                 'Date2': ['2017-06-16 16:00:00',
                                           '2017-07-21 16:00:00',
                                           '2017-08-18 16:00:00'],
                                 'Value1': [2.97, 3.3, 4.03],
                                 'Value2': [96L, 14L, 2L]})
    
    df['Date1'] = pd.to_datetime(df['Date1'])
    df['Date2'] = pd.to_datetime(df['Date2'])
    
    df['DateDiff'] = date_diff(df['Date1'], df['Date2'])
    print(df)
    

    Results:

                    Date1               Date2  Value1  Value2  DateDiff
    0 2017-05-30 16:00:00 2017-06-16 16:00:00    2.97      96        13
    1 2017-05-30 16:00:00 2017-07-21 16:00:00    3.30      14        38
    2 2017-05-30 16:00:00 2017-08-18 16:00:00    4.03       2        58