Search code examples
pythonpython-polars

How to mask a polars dataframe using another dataframe?


I have a polars dataframe like so:

df = pl.from_repr("""
┌─────────────────────┬─────────┬─────────┐
│ time                ┆ 1       ┆ 2       │
│ ---                 ┆ ---     ┆ ---     │
│ datetime[μs]        ┆ f64     ┆ f64     │
╞═════════════════════╪═════════╪═════════╡
│ 2021-10-02 00:05:00 ┆ 2.9048  ┆ 2.8849  │
│ 2021-10-02 00:10:00 ┆ 48224.0 ┆ 48068.0 │
└─────────────────────┴─────────┴─────────┘
""")

and a masking dataframe with similar columns and time value like so:

df_mask = pl.from_repr("""
┌─────────────────────┬───────┬───────┐
│ time                ┆ 1     ┆ 2     │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ bool  ┆ bool  │
╞═════════════════════╪═══════╪═══════╡
│ 2021-10-02 00:05:00 ┆ false ┆ false │
│ 2021-10-02 00:10:00 ┆ true  ┆ true  │
└─────────────────────┴───────┴───────┘
""")

I am looking for this result:

shape: (2, 3)
┌─────────────────────┬────────┬─────────┐
│ time                ┆ 1      ┆ 2       │
│ ---                 ┆ ---    ┆ ---     │
│ datetime[μs]        ┆ f64    ┆ f64     │
╞═════════════════════╪════════╪═════════╡
│ 2021-10-02 00:05:00 ┆ null   ┆ null    │
│ 2021-10-02 00:10:00 ┆ 2.8849 ┆ 48068.0 │
└─────────────────────┴────────┴─────────┘

Here I only show with 2 columns '1' and '2' but there could any number of them.

Any help would be appreciated!


Solution

  • Having columns in a single DataFrame has guarantees you don't have when you have data in separate tables.

    Masking out values by columns in another DataFrame is a potential for errors caused by different lengths. For this reason polars does not encourage such operations and therefore you must first create a single DataFrame from the two, and then select the columns/computations you need.

    So let's do that.

    The first thing you can do is join the two tables. This guaranteed to work on DataFrames of different sizes.

    df_a.join(df_mask, on="time", suffix="_mask")
    

    This comes however with a price, as joining is not free.

    If you are 100% certain your dataframes have the same height, you can use a horizontal concat.

    (
        pl.concat(
            [df_a, df_mask.select(pl.all().name.suffix("_mask"))], how="horizontal"
        ).select(
            [pl.col("time")]
            + [
                pl.when(pl.col(f"{name}_mask")).then(pl.col(name)).otherwise(None)
                for name in ["1", "2"]
            ]
        )
    )
    

    In the final select query we take the columns we want. And compute the masked values with a when -> then -> otherwise branch.

    This outputs:

    shape: (2, 3)
    ┌─────────────────────┬─────────┬─────────┐
    │ time                ┆ 1       ┆ 2       │
    │ ---                 ┆ ---     ┆ ---     │
    │ datetime[μs]        ┆ f64     ┆ f64     │
    ╞═════════════════════╪═════════╪═════════╡
    │ 2021-10-02 00:05:00 ┆ null    ┆ null    │
    │ 2021-10-02 00:10:00 ┆ 48224.0 ┆ 48068.0 │
    └─────────────────────┴─────────┴─────────┘