Search code examples
pythonpandasdatedatetimetimedelta

How to generate pandas dataframe timedelta column grouped by id and date (YYYY-MM-DD)?


Suppose I have a dataframe with id and datetime columns:

df = pd.DataFrame({"id": ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2", "a3", "a3", "a3", "a3"],
                   "datetime": ["2016-01-01 00:01:00.156",
                                "2016-01-01 12:00:00.425",
                                "2016-01-02 00:59:00.123",
                                "2016-01-02 14:16:00.548",
                                "2016-01-01 12:00:00.147",
                                "2016-01-01 13:59:00.123",
                                "2016-01-02 08:01:00.147",
                                "2016-01-02 18:49:00.123",
                                "2016-02-01 12:00:00.147",
                                "2016-02-01 13:59:00.123",
                                "2016-02-02 08:01:00.147",
                                "2016-02-02 18:49:00.123"]})
df["datetime"] = pd.to_datetime(df["datetime"])
df

Here is the dataframe:

    id  datetime
0   a1  2016-01-01 00:01:00.156
1   a1  2016-01-01 12:00:00.425
2   a1  2016-01-02 00:59:00.123
3   a1  2016-01-02 14:16:00.548
4   a2  2016-01-01 12:00:00.147
5   a2  2016-01-01 13:59:00.123
6   a2  2016-01-02 08:01:00.147
7   a2  2016-01-02 18:49:00.123
8   a3  2016-02-01 12:00:00.147
9   a3  2016-02-01 13:59:00.123
10  a3  2016-02-02 08:01:00.147
11  a3  2016-02-02 18:49:00.123

I want to generate column timedelta that has a timedelta value. This is the output I expect to get:

    id  datetime                datetime_baseline       timedelta
0   a1  2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
1   a1  2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
2   a1  2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
3   a1  2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
4   a2  2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
5   a2  2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
6   a2  2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
7   a2  2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
8   a3  2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
9   a3  2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
10  a3  2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
11  a3  2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648

Here is how the timedelta values should be calculated: 1) the code needs to identify the FIRST datetime within the same id and date ('YYYY-MM-DD'), and 2) use it as baseline (datetime_baseline) to compute the timedelta (in minutes) w.r.t. other datetimes within same id and same date. For id='a1' and date='2016-01-01', the datetime_baseline='2016-01-01 00:01:00.156'. So, at index=0, timedelta has value=0 because '2016-01-01 00:01:00.156' - datetime_baseline=0. Meanwhile, at index=1, timedelta has value=719 because '2016-01-01 12:00:00.425' - datetime_baseline=719 (minutes). At index=2, id is the same as before but date is now '2016-01-02', so a new baseline will be used: '2016-01-02 00:59:00.123'. timedelta='2016-01-02 00:59:00.123' - datetime_baseline=0. At index=3, timedelta='2016-01-02 14:16:00.548' - datetime_baseline=797.

Although I see how the timedelta values should be calculated (timedelta=datetime-datetime_baseline), I don't know how to have the baseline values identified (i.e. how to generate datetime_baseline column). Please, let me know if you need any further explanation.

ps> the actual dataframe has +500 thousand rows.


Solution

  • Try:

    df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
    df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
    
    print(df)
    

    Prints:

        id                datetime       datetime_baseline  timedelta
    0   a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156        0.0
    1   a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156      719.0
    2   a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123        0.0
    3   a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123      797.0
    4   a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147        0.0
    5   a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147      119.0
    6   a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147        0.0
    7   a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147      648.0
    8   a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147        0.0
    9   a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147      119.0
    10  a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147        0.0
    11  a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147      648.0