Search code examples
pythonpandastimeattribution

Return duration for each id


I have a large list of events being tracked with a timestamp appended to each:

I currently have the following table:

ID  Time_Stamp         Event
1   2/20/2019 18:21    0
1   2/20/2019 19:46    0
1   2/21/2019 18:35    0
1   2/22/2019 11:39    1
1   2/22/2019 16:46    0
1   2/23/2019 7:40     0
2   6/5/2019 0:10      0
3   7/31/2019 10:18    0
3   8/23/2019 16:33    0
4   6/26/2019 20:49    0

What I want is the following [but not sure if it's possible]:

ID  Time_Stamp       Conversion  Total_Duration_Days    Conversion_Duration
1   2/20/2019 18:21  0           2.555                  1.721
1   2/20/2019 19:46  0           2.555                  1.721
1   2/21/2019 18:35  0           2.555                  1.721
1   2/22/2019 11:39  1           2.555                  1.721
1   2/22/2019 16:46  1           2.555                  1.934
1   2/23/2019 7:40   0           2.555                  1.934
2   6/5/2019 0:10    0           1.00                   0.000
3   7/31/2019 10:18  0           23.260                 0.000
3   8/23/2019 16:33  0           23.260                 0.000
4   6/26/2019 20:49  0           1.00                   0.000

For #1 Total Duration = Max Date - Min Date [2.555 Days]

For #2 Conversion Duration = Conversion Date - Min Date [1.721 Days] - following actions post the conversion can remain at the calculated duration

I have attempted the following:

df.reset_index(inplace=True)
df.groupby(['ID'])['Time_Stamp].diff().fillna(0)

This kind of does what I want, but it's showing the difference between each event, not the min time stamp to the max time stamp

conv_test = df.reset_index(inplace=True)

min_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('min').to_frame('MinTime')

max_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('max').to_frame('MaxTime')

conv_test = conv_test.set_index('ID').merge(min_df, left_index=True, right_index=True)

conv_test = conv_test.merge(max_df, left_index=True, right_index=True)

conv_test['Durartion'] = conv_test['MaxTime'] - conv_test['MinTime']

This gives me Total_Duration_Days which is great [feel free to offer a more elegant solution

Any ideas on how I can get Conversion_Duration?


Solution

  • You can use GroupBy.transform with min and max for Series with same size like original, so possible subtract for Total_Duration_Days and then filter only 1 rows by Event, create Series by DataFrame.set_index and convert to dict, then Series.map for new Series, so possible subtract minimal values per groups:

    df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
    
    min1 = df.groupby('ID')['Time_Stamp'].transform('min')
    max1 = df.groupby('ID')['Time_Stamp'].transform('max')
    df['Total_Duration_Days'] = max1.sub(min1).dt.total_seconds() / (3600 * 24)
    
    d = df.loc[df['Event'] == 1].set_index('ID')['Time_Stamp'].to_dict()
    new1 = df['ID'].map(d)
    

    Because possible multiple 1 per groups is added solution only for this groups - testing, if more 1 per groups in mask, get Series new2 and then use Series.combine_first with mapped Series new1.

    Reason is improve performance, because a bit complicated processing multiple 1.

    mask = df['Event'].eq(1).groupby(df['ID']).transform('sum').gt(1)
    g = df[mask].groupby('ID')['Event'].cumsum().replace({0:np.nan})
    new2 = (df[mask].groupby(['ID', g])['Time_Stamp']
             .transform('first')
             .groupby(df['ID'])
             .bfill())
    df['Conversion_Duration'] = (new2.combine_first(new1)
                                    .sub(min1)
                                    .dt.total_seconds().fillna(0) / (3600 * 24))
    
    print (df)
       ID          Time_Stamp  Event  Total_Duration_Days  Conversion_Duration
    0   1 2019-02-20 18:21:00      0             2.554861             1.720833
    1   1 2019-02-20 19:46:00      0             2.554861             1.720833
    2   1 2019-02-21 18:35:00      0             2.554861             1.720833
    3   1 2019-02-22 11:39:00      1             2.554861             1.720833
    4   1 2019-02-22 16:46:00      1             2.554861             1.934028
    5   1 2019-02-23 07:40:00      0             2.554861             1.934028
    6   2 2019-06-05 00:10:00      0             0.000000             0.000000
    7   3 2019-07-31 10:18:00      0            23.260417             0.000000
    8   3 2019-08-23 16:33:00      0            23.260417             0.000000
    9   4 2019-06-26 20:49:00      0             0.000000             0.000000