Search code examples
pythonpandasdatetimetimestampsubtraction

Subtract last timestamp from first timestamp for each Id in Pandas Dataframe


I have a dataframe (df) with the following structure:

retweet_datetime tweet_id tweet_datetime
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22
2020-04-24 02:43:35 85053699 2020-04-24 02:28:22
2020-04-18 04:24:03 86095361 2020-04-18 00:06:01
2020-04-18 00:19:08 86095361 2020-04-18 00:06:01
2020-04-18 00:18:38 86095361 2020-04-18 00:06:01
2020-04-18 00:07:08 86095361 2020-04-18 00:06:01

The retweet_datetime is sorted from latest to newest retweets.

I'd like to create two new columns as follows:

  1. tweet_lifetime1: the difference between the last retweet time and the first retweet time, i.e., for each tweet_id: last retweet_datetime - first retweet_datetime
  2. tweet_lifetime2: the difference between the last retweet time and tweet creation time (tweet_datetime)

Update

For example, for the tweet id: "86095361":

  • tweet_lifetime1 = 2020-04-18 04:24:03 - 2020-04-18 00:07:08 (04:16:55)
  • tweet_lifetime2 = 2020-04-18 04:24:03 - 2020-04-18 00:06:01 (04:18:02)

The expected output df:

retweet_datetime tweet_id tweet_datetime lifetime1 lifetime2
2020-04-24 03:33:15 85053699 2020-04-24 02:28:22 00:49:40 01:04:53
2020-04-18 04:24:03 86095361 2020-04-18 00:06:01 04:16:55 04:18:02

I've seen several similar posts, but they mostly subtract consecutive rows. For example, I can subtract the time difference between each retweet_datetimes for each tweet id as follows:

df2 = df.assign(delta = df.groupby('tweet_id')['retweet_datetime'].diff())

Solution

  • Use named aggregation with subtract column with Series.sub, DataFrame.pop is used for drop column tmp after processing:

    df1 = (df.groupby('tweet_id', as_index=False)
             .agg(retweet_datetime=('retweet_datetime','first'),
                  tmp = ('retweet_datetime','last'),
                  tweet_datetime = ('tweet_datetime','last')))
    
    df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp'))
    df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime'])
    print (df1)
       tweet_id    retweet_datetime      tweet_datetime       lifetime1  \
    0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40   
    1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55   
    
            lifetime2  
    0 0 days 01:04:53  
    1 0 days 04:18:02  
    

    If need format HH:MM:SS use:

    def f(x):
        ts = x.total_seconds()
        hours, remainder = divmod(ts, 3600)
        minutes, seconds = divmod(remainder, 60)
        return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
    
    df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp')).apply(f)
    df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime']).apply(f)
    print (df1)
       tweet_id    retweet_datetime      tweet_datetime lifetime1 lifetime2
    0  85053699 2020-04-24 03:33:15 2020-04-24 02:28:22  00:49:40  01:04:53
    1  86095361 2020-04-18 04:24:03 2020-04-18 00:06:01  04:16:55  04:18:02