I was writing some code and realized this might be a reasonably common operation. I also realized I don't know a clean way to do it.
The questions is: Get the top 5 entries in column1, sorted by column2, within groups given by column3.
If I had to intuit how this would be writen in Polars it'd be:
df.select(pl.col('column1').top_k(n=5, by='column2').over('column3'))
But note that is made up code; it does not work.
Consider this sample data:
import numpy as np
import pandas as pd
import polars as pl
data_size = 10_000_000
np.random.seed = 1
saleValue = np.random.randint(0, 100, data_size)
storeId = np.random.choice([f'Store: {i}' for i in range(200_000)], replace=True, size=data_size)
customerId = np.random.choice([f'Customer: {i}' for i in range(1_000)], replace=True, size=data_size)
df = pd.DataFrame(
dict(storeId=storeId, customerId=customerId, saleValue=saleValue)
).pipe(pl.from_pandas)
It generates a dataframe of the form:
┌───────────────┬───────────────┬───────────┐
│ storeId ┆ customerId ┆ saleValue │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞═══════════════╪═══════════════╪═══════════╡
│ Store: 161472 ┆ Customer: 960 ┆ 29 │
│ Store: 168620 ┆ Customer: 814 ┆ 21 │
│ Store: 37904 ┆ Customer: 80 ┆ 61 │
│ Store: 166077 ┆ Customer: 516 ┆ 23 │
│ Store: 141748 ┆ Customer: 549 ┆ 58 │
└─///───────────┴─///───────────┴─///───────┘
I'm curious how one would get the top 5 customer per store, sorted by their total spend.
One solution is:
(df
# This part is essential; we need to get the total spend (sales)
.group_by('storeId','customerId')
.agg(pl.col('saleValue').sum().alias('totalSales'))
# This is the part I think could be cleaner
.sort('totalSales', descending=True)
.group_by('storeId')
.agg(pl.col('customerId').head(5).alias('customerIds'))
)
┌───────────────┬─────────────────────────────────────┐
│ storeId ┆ customerIds │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════════╪═════════════════════════════════════╡
│ Store: 78152 ┆ ["Customer: 753", "Customer: 170... │
│ Store: 67676 ┆ ["Customer: 957", "Customer: 896... │
│ Store: 45152 ┆ ["Customer: 118", "Customer: 127... │
│ Store: 183339 ┆ ["Customer: 370", "Customer: 227... │
│ Store: 144688 ┆ ["Customer: 328", "Customer: 294... │
└─///───────────┴─///─────────────────────────────────┘
But I wonder if there is something cleaner using .top_k
It can be done a bit cleaner indeed, although you are close.
Define the first step as df_agg
(for the sake of this answer, you can chain):
df_agg = df.group_by('storeId','customerId').agg(pl.col('saleValue').sum().alias('totalSales'))
Then we can do:
df_agg.group_by('storeId').agg(pl.col('customerId').sort_by('totalSales', descending=True).slice(0,5))
Which reads as:
So we do the groupby like you proposed, but the sorting is done inside the aggregation using sort_by
, rather than on the full dataframe. Also, I use slice
rather than head
+ list
.
On your point on top_k
: this function returns the largest elements of itself, not by another. Polars has sufficient ways of achieving what you want, notably sort_by
, that I dont think there is a need to complicate the implementation of top_k
by adding a by
argument.