Search code examples
dataframejoinmergepython-polarsouter-join

Polars Dataframe full-join (outer) on multiple columns without suffix


I have this code:

import polars as pl

df1 = pl.DataFrame({
    'type':   ['A', 'O', 'B', 'O'],
    'origin': ['EU', 'US', 'US', 'EU'],
    'qty1':   [343,11,22,-5]
})

df2 = pl.DataFrame({
    'type':   ['A', 'O', 'B', 'S'],
    'origin': ['EU', 'US', 'US', 'AS'],
    'qty2':   [-200,-12,-25,8]
})

df1.join(df2, on=['type', 'origin'], how='full')

which gives

┌──────┬────────┬──────┬────────────┬──────────────┬──────┐
│ type ┆ origin ┆ qty1 ┆ type_right ┆ origin_right ┆ qty2 │
│ ---  ┆ ---    ┆ ---  ┆ ---        ┆ ---          ┆ ---  │
│ str  ┆ str    ┆ i64  ┆ str        ┆ str          ┆ i64  │
╞══════╪════════╪══════╪════════════╪══════════════╪══════╡
│ A    ┆ EU     ┆ 343  ┆ A          ┆ EU           ┆ -200 │
│ O    ┆ US     ┆ 11   ┆ O          ┆ US           ┆ -12  │
│ B    ┆ US     ┆ 22   ┆ B          ┆ US           ┆ -25  │
│ null ┆ null   ┆ null ┆ S          ┆ AS           ┆ 8    │
│ O    ┆ EU     ┆ -5   ┆ null       ┆ null         ┆ null │
└──────┴────────┴──────┴────────────┴──────────────┴──────┘

But the output I am after is this:

┌──────┬────────┬──────┬──────┐
│ type ┆ origin ┆ qty1 ┆ qty2 │
│ ---  ┆ ---    ┆ ---  ┆ ---  │
│ str  ┆ str    ┆ i64  ┆ i64  │
╞══════╪════════╪══════╪══════╡
│ A    ┆ EU     ┆ 343  ┆ -200 │
│ O    ┆ US     ┆ 11   ┆ -12  │
│ B    ┆ US     ┆ 22   ┆ -25  │
│ S    ┆ AS     ┆ null ┆ 8    │
│ O    ┆ EU     ┆ -5   ┆ null │
└──────┴────────┴──────┴──────┘

I tried suffix='' via df1.join(df2, on=['type', 'origin'], how='full', suffix=''), but this raises an error:

DuplicateError: unable to hstack, column with name "type" already exists

How can I achieve this?


Solution

  • You are looking for the coalesce parameter. Setting it to True gives the desired result.

    df1.join(df2, on=['type', 'origin'], how='full', coalesce=True)
    
    shape: (5, 4)
    ┌──────┬────────┬──────┬──────┐
    │ type ┆ origin ┆ qty1 ┆ qty2 │
    │ ---  ┆ ---    ┆ ---  ┆ ---  │
    │ str  ┆ str    ┆ i64  ┆ i64  │
    ╞══════╪════════╪══════╪══════╡
    │ A    ┆ EU     ┆ 343  ┆ -200 │
    │ O    ┆ US     ┆ 11   ┆ -12  │
    │ B    ┆ US     ┆ 22   ┆ -25  │
    │ S    ┆ AS     ┆ null ┆ 8    │
    │ O    ┆ EU     ┆ -5   ┆ null │
    └──────┴────────┴──────┴──────┘
    

    From the documentation of pl.DataFrame.join.

    coalesce

    Coalescing behavior (merging of join columns).

    • None: -> join specific.

    • True: -> Always coalesce join columns.

    • False: -> Never coalesce join columns.

    Note that joining on any other expressions than col will turn off coalescing.