Search code examples
pythonpython-polars

Polars dataframe join_asof with(keep) null


Update: This issue has been resolved. df.join_asof(df2, on="time", by=["a", "b"]) now runs without error and returns the expected result.


Currently, from my expermentation, join_asof does not will cause error if there are any None(null) in either of the "by" column. Is there any way I can still use join_asof while keeping any None(null) in the left dataframe?

For example, I have the following dataframes:

df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5, None, 8], "b": [2, 3, 4, 5, 6, 7, None], "time": [1, 2, 3, 4, 5, 6, 7]}
)
df2 = pl.DataFrame({"a": [1, 3, 4, None], "b": [2, 4, 5, 8], "c": [2, 3, 4, 5], "time": [0, 2, 4, 6]})

If I just run the code below, there will be an error:

df.join_asof(df2, on="time", by=["a", "b"])

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("cannot take slice"))', /home/runner/work/polars/polars/polars/polars-core/src/frame/asof_join/groups.rs:253:35

But, the following code works well:

df.drop_nulls(["a", "b"]).join_asof(df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"])

shape: (5, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ time ┆ c    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 2   ┆ 3   ┆ 2    ┆ null │
│ 3   ┆ 4   ┆ 3    ┆ 3    │
│ 4   ┆ 5   ┆ 4    ┆ 4    │
│ 5   ┆ 6   ┆ 5    ┆ null │
└─────┴─────┴──────┴──────┘

My question is how can get the following result, basically the result above with rows (where a is null in the left dataframe - df in this case) appended?

┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ time ┆ c    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 2   ┆ 3   ┆ 2    ┆ null │
│ 3   ┆ 4   ┆ 3    ┆ 3    │
│ 4   ┆ 5   ┆ 4    ┆ 4    │
│ 5   ┆ 6   ┆ 5    ┆ null │
│ null┆ 7   ┆ 6    ┆ null │
│ 8   ┆ null┆ 7    ┆ null │
└─────┴─────┴──────┴──────┘

Thanks!


Solution

  • One easy solution is to use concat with how='diagonal'. For example:

    pl.concat(
        [
            df.drop_nulls(["a", "b"]).join_asof(df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"]),
            df.filter(pl.col('a').is_null() | pl.col('b').is_null()),
        ],
        how='diagonal'
    )
    
    shape: (7, 4)
    ┌──────┬──────┬──────┬──────┐
    │ a    ┆ b    ┆ time ┆ c    │
    │ ---  ┆ ---  ┆ ---  ┆ ---  │
    │ i64  ┆ i64  ┆ i64  ┆ i64  │
    ╞══════╪══════╪══════╪══════╡
    │ 1    ┆ 2    ┆ 1    ┆ 2    │
    │ 2    ┆ 3    ┆ 2    ┆ null │
    │ 3    ┆ 4    ┆ 3    ┆ 3    │
    │ 4    ┆ 5    ┆ 4    ┆ 4    │
    │ 5    ┆ 6    ┆ 5    ┆ null │
    │ null ┆ 7    ┆ 6    ┆ null │
    │ 8    ┆ null ┆ 7    ┆ null │
    └──────┴──────┴──────┴──────┘
    

    Edit:

    diagonal pl.concat seems to be pretty slow if the dataframes are large?

    Is it?

    import time
    
    import polars as pl
    
    mult = 100_000_000
    df = pl.DataFrame(
        {
            "a": [1, 2, 3, 4, 5, None, 8] * mult,
            "b": [2, 3, 4, 5, 6, 7, None] * mult,
            "time": [1, 2, 3, 4, 5, 6, 7] * mult,
        }
    ).sort("time")
    df2 = pl.DataFrame(
        {
            "a": [1, 3, 4, None] * mult,
            "b": [2, 4, 5, 8] * mult,
            "c": [2, 3, 4, 5] * mult,
            "time": [0, 2, 4, 6] * mult,
        }
    ).sort("time")
    
    not_null_df = df.drop_nulls(["a", "b"]).join_asof(
        df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"]
    )
    is_null_df = df.filter(pl.col("a").is_null() | pl.col("b").is_null())
    
    not_null_df
    is_null_df
    
    start = time.perf_counter()
    pl.concat([not_null_df, is_null_df], how="diagonal")
    print(time.perf_counter() - start)
    
    >>> not_null_df
    shape: (500000000, 4)
    ┌─────┬─────┬──────┬──────┐
    │ a   ┆ b   ┆ time ┆ c    │
    │ --- ┆ --- ┆ ---  ┆ ---  │
    │ i64 ┆ i64 ┆ i64  ┆ i64  │
    ╞═════╪═════╪══════╪══════╡
    │ 1   ┆ 2   ┆ 1    ┆ 2    │
    │ 1   ┆ 2   ┆ 1    ┆ 2    │
    │ 1   ┆ 2   ┆ 1    ┆ 2    │
    │ 1   ┆ 2   ┆ 1    ┆ 2    │
    │ ... ┆ ... ┆ ...  ┆ ...  │
    │ 5   ┆ 6   ┆ 5    ┆ null │
    │ 5   ┆ 6   ┆ 5    ┆ null │
    │ 5   ┆ 6   ┆ 5    ┆ null │
    │ 5   ┆ 6   ┆ 5    ┆ null │
    └─────┴─────┴──────┴──────┘
    >>> is_null_df
    shape: (200000000, 3)
    ┌──────┬──────┬──────┐
    │ a    ┆ b    ┆ time │
    │ ---  ┆ ---  ┆ ---  │
    │ i64  ┆ i64  ┆ i64  │
    ╞══════╪══════╪══════╡
    │ null ┆ 7    ┆ 6    │
    │ null ┆ 7    ┆ 6    │
    │ null ┆ 7    ┆ 6    │
    │ null ┆ 7    ┆ 6    │
    │ ...  ┆ ...  ┆ ...  │
    │ 8    ┆ null ┆ 7    │
    │ 8    ┆ null ┆ 7    │
    │ 8    ┆ null ┆ 7    │
    │ 8    ┆ null ┆ 7    │
    └──────┴──────┴──────┘
    
    >>> pl.concat([not_null_df, is_null_df], how="diagonal")
    shape: (700000000, 4)
    ┌─────┬──────┬──────┬──────┐
    │ a   ┆ b    ┆ time ┆ c    │
    │ --- ┆ ---  ┆ ---  ┆ ---  │
    │ i64 ┆ i64  ┆ i64  ┆ i64  │
    ╞═════╪══════╪══════╪══════╡
    │ 1   ┆ 2    ┆ 1    ┆ 2    │
    │ 1   ┆ 2    ┆ 1    ┆ 2    │
    │ 1   ┆ 2    ┆ 1    ┆ 2    │
    │ 1   ┆ 2    ┆ 1    ┆ 2    │
    │ ... ┆ ...  ┆ ...  ┆ ...  │
    │ 8   ┆ null ┆ 7    ┆ null │
    │ 8   ┆ null ┆ 7    ┆ null │
    │ 8   ┆ null ┆ 7    ┆ null │
    │ 8   ┆ null ┆ 7    ┆ null │
    └─────┴──────┴──────┴──────┘
    >>> print(time.perf_counter() - start)
    6.087414998997701
    

    6 seconds to concatenate datasets of 500,000,000 records and 200,000,000 records