Search code examples
pythonpython-3.xdataframesortingpython-polars

Issues when sorting a polars DataFrame based on a Categorical column


I am quite new polars, and while messing around with grouping and sorting operations, I found that sorting a polars DataFrame based on a created Categorical column results in weird behaviors.

Specifically, given a polars DataFrame df containing a categorical column decile that is created via the .qcut() method with values ('q1', 'q2', ..., 'q10'), directly applying the following: df.sort('decile') correctly sorts the data.

However, if we were to group the DataFrame based on the decile column first, and then sort it based on decile, the resulting DataFrame doesn't sort properly!

The weird thing is that if you were to sort the DataFrame based on another column before creating the decile column, sorting the decile column now works even after grouping by it; though the sorting order is reversed (i.e., ascending sorts descending and vice-versa, which is weird).

Here's a minimal reproducible example to illustrate the case:

# Create a toy DataFrame
size = 1e5
df = pl.DataFrame({
    "id": np.random.randint(50000, size=int(size), dtype=np.uint16),
    "amount": np.round(np.random.uniform(10, 100000, int(size)).astype(np.float32), 2),
    "quantity": np.random.randint(1, 7, size=int(size), dtype=np.uint16)
})

Illustrating the first case:

df = (df.group_by("id")
      .agg(revenue=pl.sum("amount"), tot_quantity=pl.sum("quantity"))
     )
df = df.with_columns(
    (df["revenue"].qcut(10, labels=[f'q{i}' for i in range(10, 0, -1)])).alias("decile")
)
df = df.group_by("decile").agg(pl.col("revenue").sum(), pl.col("tot_quantity").sum())
df = df.sort('decile')  # This doesn't work properly!

However, sorting based on revenue first, fixes the problem, but the sorting order is reversed:

df = (df.group_by("id")
      .agg(revenue=pl.sum("amount"), tot_quantity=pl.sum("quantity"))
     ).sort("revenue")  # Sort by "revenue" prior to creating "decile"
df = df.with_columns(
    (df["revenue"].qcut(10, labels=[f'q{i}' for i in range(10, 0, -1)])).alias("decile")
)
df = df.group_by("decile").agg(pl.col("revenue").sum(), pl.col("tot_quantity").sum())
df = df.sort('decile')  # It works now but order is reversed!

Anyone has any idea what's going on? I've been trying to figure out why this is happening but to no avail, I tried casting to a different datatype but still did't work.

Appreciate any help!


Solution

  • By default, the sorted order of pl.Categorical columns is done by its underlying physical representation.

    You can cast to a lexically ordered Categorical instead.

    (df["revenue"].qcut(10, labels=[f'q{i:02d}' for i in range(10, 0, -1)])).cast(pl.Categorical('lexical')).alias("decile")
    
    shape: (10, 3)
    ┌────────┬──────────────┬──────────────┐
    │ decile ┆ revenue      ┆ tot_quantity │
    │ ---    ┆ ---          ┆ ---          │
    │ cat    ┆ f32          ┆ i64          │
    ╞════════╪══════════════╪══════════════╡
    │ q01    ┆ 1.1978e9     ┆ 68797        │
    │ q02    ┆ 8.43843712e8 ┆ 51888        │
    │ q03    ┆ 6.91437504e8 ┆ 43693        │
    │ q04    ┆ 5.75548224e8 ┆ 39623        │
    │ q05    ┆ 4.72274336e8 ┆ 35500        │
    │ q06    ┆ 3.960448e8   ┆ 26126        │
    │ q07    ┆ 3.3030432e8  ┆ 24030        │
    │ q08    ┆ 2.56758752e8 ┆ 22728        │
    │ q09    ┆ 1.69156608e8 ┆ 20269        │
    │ q10    ┆ 6.3238428e7  ┆ 17518        │
    └────────┴──────────────┴──────────────┘
    

    (Note how I formatted i with 02d to get a padded 0 in, so q10 is sorted last.)