Search code examples
pythonsortingduplicatespython-polars

Is Polars Guaranteed to Maintain Order After Deduplicating Over a Column?


The Code
import polars as pl
...
# Sort by date, then pick the first row for each UID (earliest date)
sample_frame=sample_frame.sort(by=DATE_COL).unique(subset=UID_COL, keep='first')
Question

I expected the resulting frame after the above operation to be sorted in order of date, but seems not the case.

So does the deduplication operation mess up the order of the remaining rows as well? Do the polars documentation or its maintainers provide any guarantee on the row ordering after calling unique?


Solution

  • Polars won't maintain order by default since that requires more computation. If you need it to, you can use the maintain_order parameter of the unique() method:

    maintain_order

    Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.

    sample_frame = (
        sample_frame
        .sort(by=DATE_COL)
        .unique(subset=UID_COL, keep='first', maintain_order=True)
    )