Search code examples
python-polars

Removing duplicates that occur within a timedelta


I have timestamps of when a product was viewed, and am interested in removing duplicates that occur within the first ten minutes of the previous non-duplicate timestamp (these can be due to the user refreshing the page). I also want to preserve the metadata associated with each product.

To be very clear: If a specific product is first viewed at time = 25 min, and then again at time = 30 min, then the second instance is considered a duplicate and should be removed. If there is a further view of the same product at time = 26 min, then this new view is more than ten min after the previous non-duplicate view, and should therefore be kept.

from datetime import datetime
import polars as pl

df = pl.DataFrame(
    {
        "created_time": [
            datetime(2023, 1, 1, 0, 0),
            datetime(2023, 1, 1, 0, 1),
            datetime(2023, 1, 1, 0, 2),
            datetime(2023, 1, 1, 0, 3),
            datetime(2023, 1, 1, 0, 11),
            datetime(2023, 1, 1, 0, 29),
            datetime(2023, 1, 1, 0, 31),
        ],
        "product_id": [1, 1, 2, 1, 1, 1, 1],
        "metadata":["a", "b", "c", "d", "e", "f", "g"]
    }
)


df_desirable = pl.DataFrame(
    {
        "created_time": [
            datetime(2023, 1, 1, 0, 0),
            datetime(2023, 1, 1, 0, 2),
            datetime(2023, 1, 1, 0, 11),
            datetime(2023, 1, 1, 0, 29),
        ],
        "product_id": [1, 2, 1, 1],
        "metadata":["a", "c", "e", "f"]
    }
)

Raw data

print(df)
shape: (7, 3)
┌─────────────────────┬────────────┬──────────┐
│ created_time        ┆ product_id ┆ metadata │
│ ---                 ┆ ---        ┆ ---      │
│ datetime[μs]        ┆ i64        ┆ str      │
╞═════════════════════╪════════════╪══════════╡
│ 2023-01-01 00:00:00 ┆ 1          ┆ a        │
│ 2023-01-01 00:01:00 ┆ 1          ┆ b        │
│ 2023-01-01 00:02:00 ┆ 2          ┆ c        │
│ 2023-01-01 00:03:00 ┆ 1          ┆ d        │
│ 2023-01-01 00:11:00 ┆ 1          ┆ e        │
│ 2023-01-01 00:29:00 ┆ 1          ┆ f        │
│ 2023-01-01 00:31:00 ┆ 1          ┆ g        │
└─────────────────────┴────────────┴──────────┘

Desired output

print(df_desirable)
shape: (4, 3)
┌─────────────────────┬────────────┬──────────┐
│ created_time        ┆ product_id ┆ metadata │
│ ---                 ┆ ---        ┆ ---      │
│ datetime[μs]        ┆ i64        ┆ str      │
╞═════════════════════╪════════════╪══════════╡
│ 2023-01-01 00:00:00 ┆ 1          ┆ a        │
│ 2023-01-01 00:02:00 ┆ 2          ┆ c        │
│ 2023-01-01 00:11:00 ┆ 1          ┆ e        │
│ 2023-01-01 00:29:00 ┆ 1          ┆ f        │
└─────────────────────┴────────────┴──────────┘

In the above raw data, the first row is not a duplicate, because there are no timestamps with product_id = 1 within 10 minutes before it (there are no rows at all before it). Naming them by metadata now, row b is duplicate because it occurs within 10 min of a. c is a new product_id and not duplicate. d is duplicate, being within 10 min after a. e is not duplicate, because it is more than 10 min after a. f is not duplicate, because it is more than 10 min after e. g is duplicate, because it is less than 10 min after f, which is the previous non-duplicate.

I've considered grouping by the product_id, but then I'm not sure how to correctly remove the entries for product_id == 1 at minutes 1 and 2 in the example below.

I've also considered using something like:

from datetime import timedelta
df.group_by_dynamic("created_time", every=timedelta(minutes=10))

but I'm not sure how to remove the rows correctly here either.

Could someone help?


Solution

  • Not sure if I'm over-complicating things here - but it looks like a .join_asof type problem?

    • (note: the created_time column must be sorted to use with join_asof)
    # df = df.sort("created_time")
    
    df.join_asof(
       df.with_columns(window = pl.col("created_time") + pl.duration(minutes=10)),
       left_on = "created_time",
       right_on = "window",
       by = "product_id",
       strategy = "forward"
    )
    
    shape: (7, 6)
    ┌─────────────────────┬────────────┬──────────┬─────────────────────┬────────────────┬─────────────────────┐
    │ created_time        | product_id | metadata | created_time_right  | metadata_right | window              │
    │ ---                 | ---        | ---      | ---                 | ---            | ---                 │
    │ datetime[μs]        | i64        | str      | datetime[μs]        | str            | datetime[μs]        │
    ╞═════════════════════╪════════════╪══════════╪═════════════════════╪════════════════╪═════════════════════╡
    │ 2023-01-01 00:00:00 | 1          | a        | 2023-01-01 00:00:00 | a              | 2023-01-01 00:10:00 │
    │ 2023-01-01 00:01:00 | 1          | b        | 2023-01-01 00:00:00 | a              | 2023-01-01 00:10:00 │
    │ 2023-01-01 00:02:00 | 2          | c        | 2023-01-01 00:02:00 | c              | 2023-01-01 00:12:00 │
    │ 2023-01-01 00:03:00 | 1          | d        | 2023-01-01 00:00:00 | a              | 2023-01-01 00:10:00 │
    │ 2023-01-01 00:11:00 | 1          | e        | 2023-01-01 00:01:00 | b              | 2023-01-01 00:11:00 │
    │ 2023-01-01 00:29:00 | 1          | f        | 2023-01-01 00:29:00 | f              | 2023-01-01 00:39:00 │
    │ 2023-01-01 00:31:00 | 1          | g        | 2023-01-01 00:29:00 | f              | 2023-01-01 00:39:00 │
    └─────────────────────┴────────────┴──────────┴─────────────────────┴────────────────┴─────────────────────┘
    

    You can then .group_by("product_id", "window") and take the .first() from row each group.

    (df
      .join_asof(
         df.with_columns(window = pl.col("created_time") + pl.duration(minutes=10)),
         left_on = "created_time",
         right_on = "window",
         by = "product_id",
         strategy = "forward"
      )
      .group_by("product_id", "window")
      .first()
    )
    
    shape: (4, 6)
    ┌────────────┬─────────────────────┬─────────────────────┬──────────┬─────────────────────┬────────────────┐
    │ product_id | window              | created_time        | metadata | created_time_right  | metadata_right │
    │ ---        | ---                 | ---                 | ---      | ---                 | ---            │
    │ i64        | datetime[μs]        | datetime[μs]        | str      | datetime[μs]        | str            │
    ╞════════════╪═════════════════════╪═════════════════════╪══════════╪═════════════════════╪════════════════╡
    │ 1          | 2023-01-01 00:10:00 | 2023-01-01 00:00:00 | a        | 2023-01-01 00:00:00 | a              │
    │ 1          | 2023-01-01 00:11:00 | 2023-01-01 00:11:00 | e        | 2023-01-01 00:01:00 | b              │
    │ 1          | 2023-01-01 00:39:00 | 2023-01-01 00:29:00 | f        | 2023-01-01 00:29:00 | f              │
    │ 2          | 2023-01-01 00:12:00 | 2023-01-01 00:02:00 | c        | 2023-01-01 00:02:00 | c              │
    └────────────┴─────────────────────┴─────────────────────┴──────────┴─────────────────────┴────────────────┘
    

    You can add maintain_order=True to the group_by if you want to keep the original order.