Search code examples
python-polars

How to select the top_k of one column, sorted by another column, within a third column in Polars?


I was writing some code and realized this might be a reasonably common operation. I also realized I don't know a clean way to do it.

The questions is: Get the top 5 entries in column1, sorted by column2, within groups given by column3.

If I had to intuit how this would be writen in Polars it'd be:

df.select(pl.col('column1').top_k(n=5, by='column2').over('column3'))

But note that is made up code; it does not work.

Consider this sample data:

import numpy as np 
import pandas as pd 
import polars as pl 


data_size = 10_000_000

np.random.seed = 1
saleValue    = np.random.randint(0, 100, data_size)
storeId      = np.random.choice([f'Store: {i}' for i in range(200_000)], replace=True, size=data_size)
customerId   = np.random.choice([f'Customer: {i}' for i in range(1_000)], replace=True, size=data_size)

df = pd.DataFrame(
    dict(storeId=storeId, customerId=customerId, saleValue=saleValue)
).pipe(pl.from_pandas)

It generates a dataframe of the form:

┌───────────────┬───────────────┬───────────┐
│ storeId       ┆ customerId    ┆ saleValue │
│ ---           ┆ ---           ┆ ---       │
│ str           ┆ str           ┆ i32       │
╞═══════════════╪═══════════════╪═══════════╡
│ Store: 161472 ┆ Customer: 960 ┆ 29        │
│ Store: 168620 ┆ Customer: 814 ┆ 21        │
│ Store: 37904  ┆ Customer: 80  ┆ 61        │
│ Store: 166077 ┆ Customer: 516 ┆ 23        │
│ Store: 141748 ┆ Customer: 549 ┆ 58        │
└─///───────────┴─///───────────┴─///───────┘

I'm curious how one would get the top 5 customer per store, sorted by their total spend.

One solution is:

(df

  # This part is essential; we need to get the total spend (sales)
 .group_by('storeId','customerId')
 .agg(pl.col('saleValue').sum().alias('totalSales'))

  # This is the part I think could be cleaner
 .sort('totalSales', descending=True)
 .group_by('storeId')
 .agg(pl.col('customerId').head(5).alias('customerIds'))
)
┌───────────────┬─────────────────────────────────────┐
│ storeId       ┆ customerIds                         │
│ ---           ┆ ---                                 │
│ str           ┆ list[str]                           │
╞═══════════════╪═════════════════════════════════════╡
│ Store: 78152  ┆ ["Customer: 753", "Customer: 170... │
│ Store: 67676  ┆ ["Customer: 957", "Customer: 896... │
│ Store: 45152  ┆ ["Customer: 118", "Customer: 127... │
│ Store: 183339 ┆ ["Customer: 370", "Customer: 227... │
│ Store: 144688 ┆ ["Customer: 328", "Customer: 294... │
└─///───────────┴─///─────────────────────────────────┘

But I wonder if there is something cleaner using .top_k


Solution

  • It can be done a bit cleaner indeed, although you are close.

    Define the first step as df_agg (for the sake of this answer, you can chain):

    df_agg = df.group_by('storeId','customerId').agg(pl.col('saleValue').sum().alias('totalSales'))
    

    Then we can do:

    df_agg.group_by('storeId').agg(pl.col('customerId').sort_by('totalSales', descending=True).slice(0,5))
    

    Which reads as:

    • group by store
    • take column customer id
    • we sort the values of this column by the column totalSales, from high to low (descending=True)
    • we take the first 5 values of customer id

    So we do the groupby like you proposed, but the sorting is done inside the aggregation using sort_by, rather than on the full dataframe. Also, I use slice rather than head + list.

    On your point on top_k: this function returns the largest elements of itself, not by another. Polars has sufficient ways of achieving what you want, notably sort_by, that I dont think there is a need to complicate the implementation of top_k by adding a by argument.