Search code examples
pythonpandasgroup-bytimedelta

Attach a new unique ID if grouped delta > 5 seconds


I would like to attach new IDs to my current groups (id column), based on the grouped delta, which indicates how long the group (id column) did not appear in the data frame. If the delta is more than 5 (seconds), I would like to attach a new ID to my IDs, as shown in the "expected_id" column. The new ID should be unique.

Any advice on how to do this?

Here is my data (df):

id  datetime                    delta       expected_id
1   2023-12-04 15:26:47.059800  1.439766    1
1   2023-12-04 15:26:48.493609  1.433809    1
1   2023-12-04 15:26:49.890176  1.396567    1
1   2023-12-04 15:26:51.274575  1.384399    1
2   2023-12-04 15:26:51.274575  169.15666   2
1   2023-12-04 15:26:52.715784  1.441209    1
1   2023-12-04 15:26:54.288126  1.572342    1
2   2023-12-04 15:26:54.288126  3.013551    2
1   2023-12-04 15:26:55.802795  1.514669    1
2   2023-12-04 15:26:55.802795  1.514669    2
1   2023-12-04 15:26:57.189969  1.387174    1
2   2023-12-04 15:26:57.189969  1.387174    2
1   2023-12-04 15:26:58.552681  1.362712    1
1   2023-12-04 15:27:00.005263  1.452582    1
2   2023-12-04 15:27:00.005263  2.815294    2
3   2023-12-04 15:27:00.005263  nan         3
1   2023-12-04 15:27:01.378002  1.372739    1
2   2023-12-04 15:27:01.378002  1.372739    2
1   2023-12-04 15:27:02.758197  1.380195    1
1   2023-12-04 15:27:04.165050  1.406853    1
1   2023-12-04 15:27:05.562082  1.397032    1
1   2023-12-04 15:27:06.953888  1.391806    1
1   2023-12-04 15:27:08.365109  1.411221    1
1   2023-12-04 15:27:09.737913  1.372804    1
1   2023-12-04 15:27:11.295586  1.557673    1
2   2023-12-04 15:27:11.295586  9.917584    4
1   2023-12-04 15:27:12.817794  1.522208    1
1   2023-12-04 15:27:14.337981  1.520187    1 
1   2023-12-04 15:27:15.811874  1.473893    1
1   2023-12-04 15:27:17.318122  1.506248    1 
2   2023-12-04 15:27:17.318122  6.022536    5
3   2023-12-04 15:27:17.318122  17.312859   6
4   2023-12-04 15:27:17.318122  nan         7
5   2023-12-04 15:27:17.318122  nan         8
6   2023-12-04 15:27:17.318122  nan         9
1   2023-12-04 15:27:18.808131  1.490009    1
2   2023-12-04 15:27:18.808131  1.490009    5
3   2023-12-04 15:27:18.808131  1.490009    6
4   2023-12-04 15:27:18.808131  1.490009    7
5   2023-12-04 15:27:18.808131  1.490009    8
6   2023-12-04 15:27:18.808131  1.490009    9
1   2023-12-04 15:27:20.189021  1.38089     1
2   2023-12-04 15:27:20.189021  1.38089     5
3   2023-12-04 15:27:20.189021  1.38089     6
4   2023-12-04 15:27:20.189021  1.38089     7
5   2023-12-04 15:27:20.189021  1.38089     8
6   2023-12-04 15:27:20.189021  1.38089     9
1   2023-12-04 15:27:21.618110  1.429089    1
1   2023-12-04 15:27:23.099324  1.481214    1
2   2023-12-04 15:27:23.099324  2.910303    5

The delta column I achieve with this:

df['delta'] = df.groupby("id")['datetime'].diff() / np.timedelta64(1, 's')

How to create my "expected_id" column?


Solution

  • If I understand you right, you need to just convert the id to the new expected id like this:

    df["new_expected_id"] = (df["id"] != df["id"].shift()).cumsum() - 1
    print(df)
    

    Prints:

        id                   datetime  delta  expected_id  new_expected_id
    0    0 2023-12-04 10:51:30.158743    NaN            0                0
    1    1 2023-12-04 10:51:31.734037    NaN            1                1
    2    1 2023-12-04 10:51:33.219067   1.48            1                1
    3    1 2023-12-04 10:51:34.469723   1.25            1                1
    4    0 2023-12-04 10:51:35.862997   5.70            2                2
    5    0 2023-12-04 10:51:37.280209   1.41            2                2
    6    0 2023-12-04 10:51:38.741301   1.46            2                2
    7    0 2023-12-04 10:51:40.239296   1.49            2                2
    8    1 2023-12-04 10:51:41.590683   7.12            3                3
    9    1 2023-12-04 10:51:43.060751   1.47            3                3
    10   1 2023-12-04 10:51:44.566724   1.50            3                3
    11   1 2023-12-04 10:51:46.066713   1.49            3                3
    12   0 2023-12-04 10:51:47.493897   7.25            4                4
    13   0 2023-12-04 10:51:48.994885   1.50            4                4
    14   0 2023-12-04 10:51:50.557707   1.56            4                4
    15   0 2023-12-04 10:51:52.116537   1.55            4                4
    16   0 2023-12-04 10:51:53.642456   1.52            4                4
    17   1 2023-12-04 10:51:55.115518   9.04            5                5
    

    EDIT 2: With the new input:

    from itertools import count
    
    
    def fn(g, cnt):
        x = next(cnt)
        mask = g["datetime"].diff() >= "5s"
        g.loc[mask, "expected_id"] = np.array([next(cnt) for _ in range(mask.sum())])
        g["expected_id"] = g["expected_id"].ffill()
        g["expected_id"] = g["expected_id"].fillna(x).astype(int)
        return g
    
    
    c = count(1)
    out = df.groupby("id", group_keys=False, sort=False).apply(fn, cnt=c)
    print(out)
    

    Prints:

        id                   datetime       delta  expected_id
    0    1 2023-12-04 15:26:47.059800    1.439766            1
    1    1 2023-12-04 15:26:48.493609    1.433809            1
    2    1 2023-12-04 15:26:49.890176    1.396567            1
    3    1 2023-12-04 15:26:51.274575    1.384399            1
    4    2 2023-12-04 15:26:51.274575  169.156660            2
    5    1 2023-12-04 15:26:52.715784    1.441209            1
    6    1 2023-12-04 15:26:54.288126    1.572342            1
    7    2 2023-12-04 15:26:54.288126    3.013551            2
    8    1 2023-12-04 15:26:55.802795    1.514669            1
    9    2 2023-12-04 15:26:55.802795    1.514669            2
    10   1 2023-12-04 15:26:57.189969    1.387174            1
    11   2 2023-12-04 15:26:57.189969    1.387174            2
    12   1 2023-12-04 15:26:58.552681    1.362712            1
    13   1 2023-12-04 15:27:00.005263    1.452582            1
    14   2 2023-12-04 15:27:00.005263    2.815294            2
    15   3 2023-12-04 15:27:00.005263         NaN            3
    16   1 2023-12-04 15:27:01.378002    1.372739            1
    17   2 2023-12-04 15:27:01.378002    1.372739            2
    18   1 2023-12-04 15:27:02.758197    1.380195            1
    19   1 2023-12-04 15:27:04.165050    1.406853            1
    20   1 2023-12-04 15:27:05.562082    1.397032            1
    21   1 2023-12-04 15:27:06.953888    1.391806            1
    22   1 2023-12-04 15:27:08.365109    1.411221            1
    23   1 2023-12-04 15:27:09.737913    1.372804            1
    24   1 2023-12-04 15:27:11.295586    1.557673            1
    25   2 2023-12-04 15:27:11.295586    9.917584            3
    26   1 2023-12-04 15:27:12.817794    1.522208            1
    27   1 2023-12-04 15:27:14.337981    1.520187            1
    28   1 2023-12-04 15:27:15.811874    1.473893            1
    29   1 2023-12-04 15:27:17.318122    1.506248            1
    30   2 2023-12-04 15:27:17.318122    6.022536            4
    31   3 2023-12-04 15:27:17.318122   17.312859            6
    32   4 2023-12-04 15:27:17.318122         NaN            7
    33   5 2023-12-04 15:27:17.318122         NaN            8
    34   6 2023-12-04 15:27:17.318122         NaN            9
    35   1 2023-12-04 15:27:18.808131    1.490009            1
    36   2 2023-12-04 15:27:18.808131    1.490009            5
    37   3 2023-12-04 15:27:18.808131    1.490009            6
    38   4 2023-12-04 15:27:18.808131    1.490009            7
    39   5 2023-12-04 15:27:18.808131    1.490009            8
    40   6 2023-12-04 15:27:18.808131    1.490009            9
    41   1 2023-12-04 15:27:20.189021    1.380890            1
    42   2 2023-12-04 15:27:20.189021    1.380890            5
    43   3 2023-12-04 15:27:20.189021    1.380890            6
    44   4 2023-12-04 15:27:20.189021    1.380890            7
    45   5 2023-12-04 15:27:20.189021    1.380890            8
    46   6 2023-12-04 15:27:20.189021    1.380890            9
    47   1 2023-12-04 15:27:21.618110    1.429089            1
    48   1 2023-12-04 15:27:23.099324    1.481214            1
    49   2 2023-12-04 15:27:23.099324    2.910303            5