Search code examples
pythonpython-polars

Use polars .when() instead joins


I have 3 polars dataframes, one that contains 2 IDS, and the other ones contains an ID and a value. I would like to join the 3 dataframes if the ID of the main table exists on one of the other tables and bring a values from a desired column.

My current aproach its just rename each Table ID and then do a .join(how = 'left'), however, i think renaming and duplicate tables its not the correct way to approach this problem. (Due the extra code, and the wasted ram)

The first one contains 2 ID columns:

data = {
    "ID1" : [1,2,3],
    "ID2" : [1,4,5]
}
df = pl.DataFrame(data)

The second and third are dataframes than contains an ID and a value:

T1 = {
    "ID" : [9,2,5],
    "Values" : ["A","B","c"],
    "Values II" : ["foo","boo","baz"]
}
T1 = pl.DataFrame(T1)

T2 = {
    "ID" : [1,4,10],
    "Values" : ["X","J","c"]
}
T2 = pl.DataFrame(T2)

I can check if the ID exists on the other tables like this

(
    df
    .with_columns(
        ID1_is_on_T1 = pl.col("ID1").is_in(T1.select(pl.col("ID"))),
        ID2_is_on_T1 = pl.col("ID2").is_in(T1.select(pl.col("ID"))),
        ID1_is_on_T2 = pl.col("ID1").is_in(T2.select(pl.col("ID"))),
        ID2_is_on_T2 = pl.col("ID2").is_in(T2.select(pl.col("ID"))),
    )
)

And i'm looking to do somehting like this:

(
    df
    .with_columns(
        pl
        .when(
            pl.col("ID1").is_in(T1.select(pl.col("ID")))
        )
        .then(
            T1.select(pl.col("Values"))
        )
        .otherwise(0)
    )
)

ValueError: can only call .item() if the dataframe is of shape (1, 1), or if explicit row/col values are provided; frame has shape (3, 1)

Current .join() approach:

T1_1 = (
    T1
    .rename(
        {"ID": "ID1"}
    )
)

T1_2 = (
    T1
    .rename(
        {"ID": "ID2"}
    )
)

Join_1 = df.join(T1_1,on = "ID1", how="left").rename({"Values" : "ID1_Values", "Values II" : "ID1_Values II"})
Join_2 = Join_1.join(T1_2, on = "ID2", how="left").rename({"Values" : "ID2_Values", "Values II" : "ID2_Values II"})

On this approach its only considering the first table, i would need to do the same for the T2 too.


Solution

  • Two joins will be most efficient here. You can avoid excessive renaming by specifing left_on and right_on separately, as well as using a suffix to clarify the duplicate names.

    >>> (df.join(T1, left_on="ID1", right_on="ID", how="left")
    ...    .join(T2, left_on="ID2", right_on="ID", how="left", suffix="_T2"))
    shape: (3, 5)
    ┌─────┬─────┬────────┬───────────┬───────────┐
    │ ID1 ┆ ID2 ┆ Values ┆ Values II ┆ Values_T2 │
    │ --- ┆ --- ┆ ---    ┆ ---       ┆ ---       │
    │ i64 ┆ i64 ┆ str    ┆ str       ┆ str       │
    ╞═════╪═════╪════════╪═══════════╪═══════════╡
    │ 1   ┆ 1   ┆ null   ┆ null      ┆ X         │
    │ 2   ┆ 4   ┆ B      ┆ boo       ┆ J         │
    │ 3   ┆ 5   ┆ null   ┆ null      ┆ null      │
    └─────┴─────┴────────┴───────────┴───────────┘