Search code examples
pythonjoinpython-polars

Does polars preserve row order in a left join?


Consider the following polars dataframes:

>>> left = pl.DataFrame(pl.Series('a', [1,5,3,2]))
>>> left
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 5   │
├╌╌╌╌╌┤
│ 3   │
├╌╌╌╌╌┤
│ 2   │
└─────┘
>>> right = pl.DataFrame([pl.Series('a', [0,1,2,3]), pl.Series('b', [4,5,6,7])])
>>> right
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 7   │
└─────┴─────┘

I would like to join the two in such a way that the order of the a values from the left dataframe is preserved. A left join seems to do this:

>>> left.join(right, on='a', how='left')
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ 5    │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ 7    │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 6    │
└─────┴──────┘

My question is: is this behaviour guaranteed? If not, what would be the safe way to do this? I could use with_row_count and then do a final sort, but that seems rather cumbersome. In pandas this can be done concisely with the reindex method.


Solution

  • A left join guarantees preserving the order of the left dataframe, at least in the regular engine. For the streaming engine, this might not be guaranteed.

    If you want to be 'safe', you already have the right workaround in mind to add a row count and sort on that.