Search code examples
pythonpandasfor-loopdatetimedatediff

How to count date difference in python


I have a data frame named table which looks like below

       Date         ID
0   2015-12-30  2658789.0
1   2017-04-12  2658789.0
2   2014-06-23  2658789.0
3   2015-12-30  2658789.0
4   2016-01-06  2658789.0
5   2012-10-15  2658731.0
6   2011-04-14  2660120.0
7   2015-09-28  2660120.0
8   2017-02-10  2660120.0
9   2016-12-12  2660649.0
10  2017-05-03  2660649.0
11  2016-05-25  2660649.0
12  2015-09-28  2660649.0
13  2014-08-11  2660649.0
14  2016-01-22  2660649.0

print(table)
{'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}

The types of 2 columns in table

table.dtypes 
Date     object
ID      float64
dtype: object

I need to count the average of follow-up times for each ID.
For each ID, follow-up times are defined as the time period (in days) between
the latest date and the first date
the second-latest date and the first date
the third-latest date and the first date
etc.

I wrote script for more simple situation (please see below), in which the follow up time is just the time period between the earliest and the latest date for each ID. The problem with my script is that as it iterated over rows, it took much time to run

# Group and get distinct Date by ID 
table = table['Date'].groupby(table['ID']).unique().reset_index()

# Get the latest and earliest Dates of each ID
from datetime import datetime
result = []
for index in range(len(table)):
    list_of_dates= [datetime.strptime(date,"%Y-%m-%d") for date in table.iloc[index,1]]
    n = table.iloc[index,0]
    x = max(list_of_dates)
    y = min(list_of_dates)
    result.append([n, x, y])
    df = pd.DataFrame(result, columns=['ID', 'LatestDate', 'EarliestDate'])
    print(df)

# Count Follow-up time
df.loc[:,'LatestDate':'EarliestDate'] = df.loc[:,'LatestDate':'EarliestDate'].astype('datetime64[ns]')

df['FollowUpPeriod'] = (df['LatestDate'] - df['EarliestDate']).dt.days
df['FollowUpPeriod'].mean()

Could anyone help me with the new situation and in a more efficient way?
Any help would be greatly appreciated!!


Solution

  • import pandas as pd
    
    data = {'Date': {0: '2015-12-30', 1: '2017-04-12', 2: '2014-06-23', 3: '2015-12-30', 4: '2016-01-06', 5: '2012-10-15', 6: '2011-04-14', 7: '2015-09-28', 8: '2017-02-10', 9: '2016-12-12', 10: '2017-05-03', 11: '2016-05-25', 12: '2015-09-28', 13: '2014-08-11', 14: '2016-01-22'}, 'ID': {0: 2658789.0, 1: 2658789.0, 2: 2658789.0, 3: 2658789.0, 4: 2658789.0, 5: 2658731.0, 6: 2660120.0, 7: 2660120.0, 8: 2660120.0, 9: 2660649.0, 10: 2660649.0, 11: 2660649.0, 12: 2660649.0, 13: 2660649.0, 14: 2660649.0}}
    df = pd.DataFrame.from_dict(data)
    df.Date = pd.to_datetime(df.Date)
    
    (
        df
        .set_index('ID')
        .get('Date')
        .sort_values()
        .groupby(level=0)
        .apply(lambda s: (s.drop_duplicates().iloc[1:] - s.iloc[0]).mean())
        .fillna(pd.Timedelta(0))
    )
    

    Output:

    ID
    2658731.0      0 days 00:00:00
    2658789.0    713 days 16:00:00
    2660120.0   1878 days 12:00:00
    2660649.0    689 days 00:00:00
    Name: Date, dtype: timedelta64[ns]
    

    Let's check it manually for ID = 2658789.0 to be sure we are on the right way:

    df.query('ID==2658789')['Date'].drop_duplicates().sort_values()
    
    2   2014-06-23
    0   2015-12-30
    4   2016-01-06
    1   2017-04-12
    

    So we have to calculate the mean of three differences:

    (
      (2015-12-30 - 2014-06-23) 
    + (2016-01-06 - 2014-06-23)
    + (2017-04-12 - 2014-06-23)
    ) / 3 
    = (555 days + 562 days + 1024 days)/3 
    = 713.6(6) days    (0.66... days = 24 * 0.66... hours = 16 hours)