I have timestamps of when a product was viewed, and am interested in removing duplicates that occur within the first ten minutes of the previous non-duplicate timestamp (these can be due to the user refreshing the page). I also want to preserve the metadata associated with each product.
To be very clear: If a specific product is first viewed at time = 25 min, and then again at time = 30 min, then the second instance is considered a duplicate and should be removed. If there is a further view of the same product at time = 26 min, then this new view is more than ten min after the previous non-duplicate view, and should therefore be kept.
from datetime import datetime
import polars as pl
df = pl.DataFrame(
{
"created_time": [
datetime(2023, 1, 1, 0, 0),
datetime(2023, 1, 1, 0, 1),
datetime(2023, 1, 1, 0, 2),
datetime(2023, 1, 1, 0, 3),
datetime(2023, 1, 1, 0, 11),
datetime(2023, 1, 1, 0, 29),
datetime(2023, 1, 1, 0, 31),
],
"product_id": [1, 1, 2, 1, 1, 1, 1],
"metadata":["a", "b", "c", "d", "e", "f", "g"]
}
)
df_desirable = pl.DataFrame(
{
"created_time": [
datetime(2023, 1, 1, 0, 0),
datetime(2023, 1, 1, 0, 2),
datetime(2023, 1, 1, 0, 11),
datetime(2023, 1, 1, 0, 29),
],
"product_id": [1, 2, 1, 1],
"metadata":["a", "c", "e", "f"]
}
)
Raw data
print(df)
shape: (7, 3)
┌─────────────────────┬────────────┬──────────┐
│ created_time ┆ product_id ┆ metadata │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ str │
╞═════════════════════╪════════════╪══════════╡
│ 2023-01-01 00:00:00 ┆ 1 ┆ a │
│ 2023-01-01 00:01:00 ┆ 1 ┆ b │
│ 2023-01-01 00:02:00 ┆ 2 ┆ c │
│ 2023-01-01 00:03:00 ┆ 1 ┆ d │
│ 2023-01-01 00:11:00 ┆ 1 ┆ e │
│ 2023-01-01 00:29:00 ┆ 1 ┆ f │
│ 2023-01-01 00:31:00 ┆ 1 ┆ g │
└─────────────────────┴────────────┴──────────┘
Desired output
print(df_desirable)
shape: (4, 3)
┌─────────────────────┬────────────┬──────────┐
│ created_time ┆ product_id ┆ metadata │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ str │
╞═════════════════════╪════════════╪══════════╡
│ 2023-01-01 00:00:00 ┆ 1 ┆ a │
│ 2023-01-01 00:02:00 ┆ 2 ┆ c │
│ 2023-01-01 00:11:00 ┆ 1 ┆ e │
│ 2023-01-01 00:29:00 ┆ 1 ┆ f │
└─────────────────────┴────────────┴──────────┘
In the above raw data, the first row is not a duplicate, because there are no timestamps with product_id = 1 within 10 minutes before it (there are no rows at all before it). Naming them by metadata now, row b
is duplicate because it occurs within 10 min of a
. c
is a new product_id and not duplicate. d
is duplicate, being within 10 min after a
. e
is not duplicate, because it is more than 10 min after a
. f
is not duplicate, because it is more than 10 min after e
. g
is duplicate, because it is less than 10 min after f
, which is the previous non-duplicate.
I've considered grouping by the product_id
, but then I'm not sure how to correctly remove the entries for product_id == 1
at minutes 1 and 2 in the example below.
I've also considered using something like:
from datetime import timedelta
df.group_by_dynamic("created_time", every=timedelta(minutes=10))
but I'm not sure how to remove the rows correctly here either.
Could someone help?
Not sure if I'm over-complicating things here - but it looks like a .join_asof
type problem?
created_time
column must be sorted to use with join_asof)# df = df.sort("created_time")
df.join_asof(
df.with_columns(window = pl.col("created_time") + pl.duration(minutes=10)),
left_on = "created_time",
right_on = "window",
by = "product_id",
strategy = "forward"
)
shape: (7, 6)
┌─────────────────────┬────────────┬──────────┬─────────────────────┬────────────────┬─────────────────────┐
│ created_time | product_id | metadata | created_time_right | metadata_right | window │
│ --- | --- | --- | --- | --- | --- │
│ datetime[μs] | i64 | str | datetime[μs] | str | datetime[μs] │
╞═════════════════════╪════════════╪══════════╪═════════════════════╪════════════════╪═════════════════════╡
│ 2023-01-01 00:00:00 | 1 | a | 2023-01-01 00:00:00 | a | 2023-01-01 00:10:00 │
│ 2023-01-01 00:01:00 | 1 | b | 2023-01-01 00:00:00 | a | 2023-01-01 00:10:00 │
│ 2023-01-01 00:02:00 | 2 | c | 2023-01-01 00:02:00 | c | 2023-01-01 00:12:00 │
│ 2023-01-01 00:03:00 | 1 | d | 2023-01-01 00:00:00 | a | 2023-01-01 00:10:00 │
│ 2023-01-01 00:11:00 | 1 | e | 2023-01-01 00:01:00 | b | 2023-01-01 00:11:00 │
│ 2023-01-01 00:29:00 | 1 | f | 2023-01-01 00:29:00 | f | 2023-01-01 00:39:00 │
│ 2023-01-01 00:31:00 | 1 | g | 2023-01-01 00:29:00 | f | 2023-01-01 00:39:00 │
└─────────────────────┴────────────┴──────────┴─────────────────────┴────────────────┴─────────────────────┘
You can then .group_by("product_id", "window")
and take the .first()
from row each group.
(df
.join_asof(
df.with_columns(window = pl.col("created_time") + pl.duration(minutes=10)),
left_on = "created_time",
right_on = "window",
by = "product_id",
strategy = "forward"
)
.group_by("product_id", "window")
.first()
)
shape: (4, 6)
┌────────────┬─────────────────────┬─────────────────────┬──────────┬─────────────────────┬────────────────┐
│ product_id | window | created_time | metadata | created_time_right | metadata_right │
│ --- | --- | --- | --- | --- | --- │
│ i64 | datetime[μs] | datetime[μs] | str | datetime[μs] | str │
╞════════════╪═════════════════════╪═════════════════════╪══════════╪═════════════════════╪════════════════╡
│ 1 | 2023-01-01 00:10:00 | 2023-01-01 00:00:00 | a | 2023-01-01 00:00:00 | a │
│ 1 | 2023-01-01 00:11:00 | 2023-01-01 00:11:00 | e | 2023-01-01 00:01:00 | b │
│ 1 | 2023-01-01 00:39:00 | 2023-01-01 00:29:00 | f | 2023-01-01 00:29:00 | f │
│ 2 | 2023-01-01 00:12:00 | 2023-01-01 00:02:00 | c | 2023-01-01 00:02:00 | c │
└────────────┴─────────────────────┴─────────────────────┴──────────┴─────────────────────┴────────────────┘
You can add maintain_order=True
to the group_by if you want to keep the original order.