Search code examples
pandasdataframenumpyrow

How to get the latest row at specific time intervals to form a dataframe?


Supposed 3 kids are having a contest to see who can sell the most candies, chocolate bars and cookies over a couple of days. They started their contest that very day at 08:15:00 (8.15am) and agreed to enter their sale into a tracker as shown in the data frame below:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
    'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
    'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
    'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
    'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
    
    Name    Timestamp   Candy   Chocolate Bars  Cookies
0   Harvey  2022-01-01 08:17:23.12  2   4   1
1   Khala   2022-01-01 08:22:58.76  1   NaN 1
2   Gaddy   2022-01-01 08:19:02.57  3   6   4
3   Harvey  2022-01-01 08:55:43.99  3   7   2
4   Khala   2022-01-01 08:41:23.10  5   8   4
5   Gaddy   2022-01-01 09:14:59.99  4   6   5
6   Harvey  2022-01-01 09:15:02.02  6   7   5
7   Khala   2022-01-01 09:44:43.30  6   13  8
8   Gaddy   2022-01-01 09:54:23.71  4   10  11
9   Khala   2022-01-01 10:15:00.00  10  19  8
10  Harvey  2022-01-01 10:15:02.99  9   11  15
11  Gaddy   2022-01-01 10:19:43.52  14  11  17

​Now the intention is to create a new data frame that captures each child's latest sale in 1-hour intervals (an example of an hour's window would be 08:15:00.00 - 09:14:59.99) and the window they were captured in. Such that the data frame will look like this:

    Name    Window  Timestamp   Candy   Chocolate Bars  Cookies
1   Harvey  09:15:00.00 2022-01-01 08:55:43.99  3   7   2
2   Khala   09:15:00.00 2022-01-01 08:41:23.10  5   8   4
3   Gaddy   09:15:00.00 2022-01-01 09:14:59.99  4   6   5
4   Harvey  10:15:00.00 2022-01-01 09:15:02.02  6   7   5
5   Khala   10:15:00.00 2022-01-01 09:44:43.30  6   13  8
6   Gaddy   10:15:00.00 2022-01-01 09:54:23.71  4   10  11
7   Khala   11:15:00.00 2022-01-01 10:15:00.00  10  19  8
8   Harvey  11:15:00.00 2022-01-01 10:15:02.99  9   11  15
9   Gaddy   11:15:00.00 2022-01-01 10:19:43.52  14  11  17

Solution

  • The first thing I would do would be to convert the timestamp column to datetime to make it more easy to work with

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({
        'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
        'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
        'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
        'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
        'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
    })
    df["Timestamp"] = pd.to_datetime(df["Timestamp"]) 
    

    Then the next step would be to add the window column

    # Get window
    window_start = pd.to_timedelta("15min")
    df["Window"] = (df["Timestamp"] - window_start).dt.floor("1h") + window_start
    

    Which you can do by first shifting the times by 15 minutes take only the hours and then add back the 15 minutes. If you don't want to keep the date in the window that is possible as well.

    The last step is to sort the timestamps and only keep one per window and person

    # Keep only one row per window and person
    df = df.sort_values("Timestamp", ascending=False).groupby(["Name", "Window"]).head(1)
    df = df.sort_index().reset_index(drop=True)