Search code examples
pythondataframepython-polars

Combine cross between 2 dataframe efficiently


I am working with 2 datasets. One describes some time windows by their start and stop times. The second one contains a big list of events with their corresponding timestamps.

I want to combine this into a single dataframe that contains the start and stop time of each window, together with how many events happened during this time window.

I have managed to "solve" my problem with:

import polars as pl

actions = {
    "id": ["a", "a", "a", "a", "b", "b", "a", "a"],
    "action": ["start", "stop", "start", "stop", "start", "stop", "start", "stop"],
    "time": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
}

events = {
    "name": ["x", "x", "x", "y", "y", "z", "w", "w", "w"],
    "time": [0.0, 0.1, 0.5, 1.1, 2.5, 3.0, 4.5, 4.9, 5.5],
}

actions_df = (
    pl.DataFrame(actions)
    .group_by("id")
    .agg(
        start=pl.col("time").filter(pl.col("action") == "start"),
        stop=pl.col("time").filter(pl.col("action") == "stop"),
    )
    .explode(["start", "stop"])
)

df = (
    actions_df.join(pl.DataFrame(events), how="cross")
    .filter((pl.col("time") >= pl.col("start")) & (pl.col("time") <= pl.col("stop")))
    .group_by(["id", "start", "stop", "name"])
    .agg(count=pl.count("name"))
    .pivot("name", index=["id", "start", "stop"], values="count")
    .fill_null(0)
)

result_df = (
    actions_df.join(df, on=["id", "start", "stop"], how="left")
    .fill_null(0)
    .sort("start")
)

print(result_df)
"""
┌─────┬───────┬──────┬─────┬─────┬─────┬─────┐
│ id  ┆ start ┆ stop ┆ w   ┆ y   ┆ x   ┆ z   │
│ --- ┆ ---   ┆ ---  ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64   ┆ f64  ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═══════╪══════╪═════╪═════╪═════╪═════╡
│ a   ┆ 0.0   ┆ 1.0  ┆ 0   ┆ 0   ┆ 3   ┆ 0   │
│ a   ┆ 2.0   ┆ 3.0  ┆ 0   ┆ 1   ┆ 0   ┆ 1   │
│ b   ┆ 4.0   ┆ 5.0  ┆ 2   ┆ 0   ┆ 0   ┆ 0   │
│ a   ┆ 6.0   ┆ 7.0  ┆ 0   ┆ 0   ┆ 0   ┆ 0   │
└─────┴───────┴──────┴─────┴─────┴─────┴─────┘
"""

My issue is that this approach "explodes" in RAM and my process gets killed. I guess that the join(... how="cross") makes my dataframe huge, just to then ignore most of it again.

Can I get some help/hints on a better way to solve this?

To give some orders of magnitude, my "actions" datasets have on the order of 100-500 time windows (~1 MB), and my "events" datasets have on the order of ~10 million (~200 MB). And I am getting my process killed with 16 GB of RAM.

EDIT In real data, my intervals can be overlapping. Thanks to @RomanPekar for bringing this up.


Solution

  • The mentioned non-equi joins PR has been merged as part of Polars 1.7.0

    It is called .join_where() and is just an inner join for now.

    (actions_df
      .join_where(events_df,
         pl.col.start <= pl.col.time,
         pl.col.stop >= pl.col.time
      )
      .pivot(
         on = "name",
         values = "time",
         aggregate_function = pl.len()
      )
    )
    
    shape: (3, 7)
    ┌─────┬───────┬──────┬──────┬──────┬──────┬──────┐
    │ id  ┆ start ┆ stop ┆ x    ┆ y    ┆ z    ┆ w    │
    │ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
    │ str ┆ f64   ┆ f64  ┆ u32  ┆ u32  ┆ u32  ┆ u32  │
    ╞═════╪═══════╪══════╪══════╪══════╪══════╪══════╡
    │ a   ┆ 0.0   ┆ 1.0  ┆ 3    ┆ null ┆ null ┆ null │
    │ a   ┆ 2.0   ┆ 3.0  ┆ null ┆ 1    ┆ 1    ┆ null │
    │ b   ┆ 4.0   ┆ 5.0  ┆ null ┆ null ┆ null ┆ 2    │
    └─────┴───────┴──────┴──────┴──────┴──────┴──────┘
    

    There is an issue for additional join types: