I have a pandas Dataframe that looks similar to this:
Index | ID | time_1 | time_2 |
---|---|---|---|
0 | 101 | 2024-06-20 14:32:22 | 2024-06-20 14:10:31 |
1 | 101 | 2024-06-20 15:21:31 | 2024-06-20 14:32:22 |
2 | 101 | 2024-06-20 15:21:31 | 2024-06-20 15:21:31 |
3 | 102 | 2024-06-20 16:26:51 | 2024-06-20 15:21:31 |
4 | 102 | 2024-06-20 16:26:51 | 2024-06-20 16:56:24 |
5 | 103 | 2024-06-20 20:05:44 | 2024-06-20 21:17:35 |
6 | 103 | 2024-06-20 22:41:22 | 2024-06-20 22:21:31 |
7 | 103 | 2024-06-20 23:11:56 | 2024-06-20 23:01:31 |
For each ID in my df I want to take the second latest time_1 (if it exists). I then want to compare this time with the timestamps in time_2 and remove all rows from my df where time_2 is earlier than this time. My expected output would be:
Index | ID | time_1 | time_2 |
---|---|---|---|
1 | 101 | 2024-06-20 15:21:31 | 2024-06-20 14:32:22 |
2 | 101 | 2024-06-20 15:21:31 | 2024-06-20 15:21:31 |
3 | 102 | 2024-06-20 16:26:51 | 2024-06-20 15:21:31 |
4 | 102 | 2024-06-20 16:26:51 | 2024-06-20 16:56:24 |
7 | 103 | 2024-06-20 23:11:56 | 2024-06-20 23:01:31 |
This problem is above my pandas level. I asked ChatGPT and this is the solution I got which in principle does what I want:
import pandas as pd
ids = [101, 101, 101, 102, 102, 103, 103, 103]
time_1 = ['2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:26:51', '2024-06-20 16:26:51', '2024-06-20 20:05:44', '2024-06-20 22:41:22', '2024-06-20 23:11:56']
time_2 = ['2024-06-20 14:10:31', '2024-06-20 14:32:22', '2024-06-20 15:21:31', '2024-06-20 15:21:31', '2024-06-20 16:56:24', '2024-06-20 21:17:35', '2024-06-20 22:21:31', '2024-06-20 23:01:31']
df = pd.DataFrame({
'id': ids,
'time_1': pd.to_datetime(time_1),
'time_2': pd.to_datetime(time_2)
})
grouped = df.groupby('id')['time_1']
mask = pd.Series(False, index=df.index)
for id_value, group in df.groupby('id'):
# Remove duplicates and sort timestamps
unique_sorted_times = group['time_1'].drop_duplicates().sort_values()
# Check if there's more than one unique time
if len(unique_sorted_times) > 1:
# Select the second last time
second_last_time = unique_sorted_times.iloc[-2]
# Update the mask for rows with time_2 greater than or equal to the second last time_1
mask |= (df['id'] == id_value) & (df['time_2'] >= second_last_time)
else:
# If there's only one unique time, keep the row(s)
mask |= (df['id'] == id_value)
filtered_data = df[mask]
My issue with this solution is the for-loop. This seems rather inefficient and my real data is quite large. And also I am curious if there is a better, more efficient solution for this.
You can use .transform()
to create the mask.
Sorting is not necessary when you can just use .nlargest()
and select the second one if it exists. Or if time_1
is already sorted, you can even skip .nlargest()
(or sorting) entirely.
Then you just need to replace NaT
with the smallest possible Timestamp
value so that time_2
can't be earlier than it when you do the comparison.
second_last_times = df.groupby('id')['time_1'].transform(
lambda s: s.drop_duplicates().nlargest(2).iloc[1:].squeeze())
mask = second_last_times.fillna(pd.Timestamp.min).le(df['time_2'])
df[mask]
Result:
id time_1 time_2
1 101 2024-06-20 15:21:31 2024-06-20 14:32:22
2 101 2024-06-20 15:21:31 2024-06-20 15:21:31
3 102 2024-06-20 16:26:51 2024-06-20 15:21:31
4 102 2024-06-20 16:26:51 2024-06-20 16:56:24
7 103 2024-06-20 23:11:56 2024-06-20 23:01:31
For reference, second_last_times
:
0 2024-06-20 14:32:22
1 2024-06-20 14:32:22
2 2024-06-20 14:32:22
3 NaT
4 NaT
5 2024-06-20 22:41:22
6 2024-06-20 22:41:22
7 2024-06-20 22:41:22
Name: time_1, dtype: datetime64[ns]
If you want to generalize this, replace .nlargest(2).iloc[1:]
with .nlargest(n).iloc[n-1:]
.
P.S. This is similar to mozway's solution, but I actually wrote the code before they posted, except the squeeze
technique - thanks for that.