What is the best way to find the differences between 2 Polars dataframes?
The equals
method tells me if there is a difference, I want to find where is the difference.
Example:
import polars as pl
df1 = pl.DataFrame([
{'id': 1,'col1': ['a',None],'col2': ['x']},
{'id': 2,'col1': ['b'],'col2': ['y', None]},
{'id': 3,'col1': [None],'col2': ['z']}]
)
┌─────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ ["a", null] ┆ ["x"] │
│ 2 ┆ ["b"] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┘
df2 = pl.DataFrame([
{'id': 1,'col1': ['a'],'col2': ['x']},
{'id': 2,'col1': ['b', None],'col2': ['y', None]},
{'id': 3,'col1': [None],'col2': ['z']}]
)
┌─────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ ["a"] ┆ ["x"] │
│ 2 ┆ ["b", null] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┘
The difference in the example is for id = 1 and id = 2.
I can join the dataframes:
df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 ┆ col1_df2 ┆ col2_df2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ ["a", null] ┆ ["x"] ┆ ["a"] ┆ ["x"] │
│ 2 ┆ ["b"] ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘
Expected result
I would like to either:
The example has only 2 columns, but there are more columns in the dataframe.
Here's the filter approach
(
df1.join(df2, on='id', suffix='_df2')
.filter(pl.any_horizontal(
pl.col(x).ne_missing(pl.col(f"{x}_df2"))
for x in df1.columns if x!='id'
))
)
If you wanted the bool column then you just change the filter
to with_columns
and add an alias.
(
df1.join(df2, on='id', suffix='_df2')
.with_columns(
has_diff = pl.any_horizontal(
pl.col(x).ne_missing(pl.col(f"{x}_df2"))
for x in df1.columns if x!='id'
)
)
)
This assumes that each df has all the same columns other than 'id'.