I have 3 polars dataframes, one that contains 2 IDS, and the other ones contains an ID and a value. I would like to join the 3 dataframes if the ID of the main table exists on one of the other tables and bring a values from a desired column.
My current aproach its just rename each Table ID and then do a .join(how = 'left')
, however, i think renaming and duplicate tables its not the correct way to approach this problem. (Due the extra code, and the wasted ram)
The first one contains 2 ID columns:
data = {
"ID1" : [1,2,3],
"ID2" : [1,4,5]
}
df = pl.DataFrame(data)
The second and third are dataframes than contains an ID and a value:
T1 = {
"ID" : [9,2,5],
"Values" : ["A","B","c"],
"Values II" : ["foo","boo","baz"]
}
T1 = pl.DataFrame(T1)
T2 = {
"ID" : [1,4,10],
"Values" : ["X","J","c"]
}
T2 = pl.DataFrame(T2)
I can check if the ID
exists on the other tables like this
(
df
.with_columns(
ID1_is_on_T1 = pl.col("ID1").is_in(T1.select(pl.col("ID"))),
ID2_is_on_T1 = pl.col("ID2").is_in(T1.select(pl.col("ID"))),
ID1_is_on_T2 = pl.col("ID1").is_in(T2.select(pl.col("ID"))),
ID2_is_on_T2 = pl.col("ID2").is_in(T2.select(pl.col("ID"))),
)
)
And i'm looking to do somehting like this:
(
df
.with_columns(
pl
.when(
pl.col("ID1").is_in(T1.select(pl.col("ID")))
)
.then(
T1.select(pl.col("Values"))
)
.otherwise(0)
)
)
ValueError: can only call .item() if the dataframe is of shape (1, 1), or if explicit row/col values are provided; frame has shape (3, 1)
Current .join()
approach:
T1_1 = (
T1
.rename(
{"ID": "ID1"}
)
)
T1_2 = (
T1
.rename(
{"ID": "ID2"}
)
)
Join_1 = df.join(T1_1,on = "ID1", how="left").rename({"Values" : "ID1_Values", "Values II" : "ID1_Values II"})
Join_2 = Join_1.join(T1_2, on = "ID2", how="left").rename({"Values" : "ID2_Values", "Values II" : "ID2_Values II"})
On this approach its only considering the first table, i would need to do the same for the T2 too.
Two joins will be most efficient here. You can avoid excessive renaming by specifing left_on
and right_on
separately, as well as using a suffix
to clarify the duplicate names.
>>> (df.join(T1, left_on="ID1", right_on="ID", how="left")
... .join(T2, left_on="ID2", right_on="ID", how="left", suffix="_T2"))
shape: (3, 5)
┌─────┬─────┬────────┬───────────┬───────────┐
│ ID1 ┆ ID2 ┆ Values ┆ Values II ┆ Values_T2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ str │
╞═════╪═════╪════════╪═══════════╪═══════════╡
│ 1 ┆ 1 ┆ null ┆ null ┆ X │
│ 2 ┆ 4 ┆ B ┆ boo ┆ J │
│ 3 ┆ 5 ┆ null ┆ null ┆ null │
└─────┴─────┴────────┴───────────┴───────────┘