Search code examples
pythonpandasgroup-by

Efficiently remove rows from pandas df based on second latest time in column


I have a pandas Dataframe that looks similar to this:

Index ID time_1 time_2
0 101 2024-06-20 14:32:22 2024-06-20 14:10:31
1 101 2024-06-20 15:21:31 2024-06-20 14:32:22
2 101 2024-06-20 15:21:31 2024-06-20 15:21:31
3 102 2024-06-20 16:26:51 2024-06-20 15:21:31
4 102 2024-06-20 16:26:51 2024-06-20 16:56:24
5 103 2024-06-20 20:05:44 2024-06-20 21:17:35
6 103 2024-06-20 22:41:22 2024-06-20 22:21:31
7 103 2024-06-20 23:11:56 2024-06-20 23:01:31

For each ID in my df I want to take the second latest time_1 (if it exists). I then want to compare this time with the timestamps in time_2 and remove all rows from my df where time_2 is earlier than this time. My expected output would be:

Index ID time_1 time_2
1 101 2024-06-20 15:21:31 2024-06-20 14:32:22
2 101 2024-06-20 15:21:31 2024-06-20 15:21:31
3 102 2024-06-20 16:26:51 2024-06-20 15:21:31
4 102 2024-06-20 16:26:51 2024-06-20 16:56:24
7 103 2024-06-20 23:11:56 2024-06-20 23:01:31

This problem is above my pandas level. I asked ChatGPT and this is the solution I got which in principle does what I want:

import pandas as pd

ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']


df = pd.DataFrame({
    'id': ids,
    'time_1': pd.to_datetime(time_1),
    'time_2': pd.to_datetime(time_2)
})

grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)

for id_value, group in df.groupby('id'):
    # Remove duplicates and sort timestamps
    unique_sorted_times = group['time_1'].drop_duplicates().sort_values()

    # Check if there's more than one unique time
    if len(unique_sorted_times) > 1:
        # Select the second last time
        second_last_time = unique_sorted_times.iloc[-2]
        # Update the mask for rows with time_2 greater than or equal to the second last time_1
        mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
    else:
        # If there's only one unique time, keep the row(s)
        mask |= (df['id'] == id_value)

filtered_data = df[mask]

My issue with this solution is the for-loop. This seems rather inefficient and my real data is quite large. And also I am curious if there is a better, more efficient solution for this.


Solution

  • You can use .transform() to create the mask.

    Sorting is not necessary when you can just use .nlargest() and select the second one if it exists. Or if time_1 is already sorted, you can even skip .nlargest() (or sorting) entirely.

    Then you just need to replace NaT with the smallest possible Timestamp value so that time_2 can't be earlier than it when you do the comparison.

    second_last_times = df.groupby('id')['time_1'].transform(
        lambda s: s.drop_duplicates().nlargest(2).iloc[1:].squeeze())
    mask = second_last_times.fillna(pd.Timestamp.min).le(df['time_2'])
    df[mask]
    

    Result:

        id              time_1              time_2
    1  101 2024-06-20 15:21:31 2024-06-20 14:32:22
    2  101 2024-06-20 15:21:31 2024-06-20 15:21:31
    3  102 2024-06-20 16:26:51 2024-06-20 15:21:31
    4  102 2024-06-20 16:26:51 2024-06-20 16:56:24
    7  103 2024-06-20 23:11:56 2024-06-20 23:01:31
    

    For reference, second_last_times:

    0   2024-06-20 14:32:22
    1   2024-06-20 14:32:22
    2   2024-06-20 14:32:22
    3                   NaT
    4                   NaT
    5   2024-06-20 22:41:22
    6   2024-06-20 22:41:22
    7   2024-06-20 22:41:22
    Name: time_1, dtype: datetime64[ns]
    

    If you want to generalize this, replace .nlargest(2).iloc[1:] with .nlargest(n).iloc[n-1:].


    P.S. This is similar to mozway's solution, but I actually wrote the code before they posted, except the squeeze technique - thanks for that.