import polars as pl
...
# Sort by date, then pick the first row for each UID (earliest date)
sample_frame=sample_frame.sort(by=DATE_COL).unique(subset=UID_COL, keep='first')
I expected the resulting frame after the above operation to be sorted in order of date, but seems not the case.
So does the deduplication operation mess up the order of the remaining rows as well? Do the polars documentation or its maintainers provide any guarantee on the row ordering after calling unique
?
Polars won't maintain order by default since that requires more computation. If you need it to, you can use the maintain_order
parameter of the unique()
method:
maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.
sample_frame = (
sample_frame
.sort(by=DATE_COL)
.unique(subset=UID_COL, keep='first', maintain_order=True)
)