Let's say I have a DataFrame:
df = pl.from_repr("""
┌────────────────┬──────┬───────┐
│ transaction_id ┆ user ┆ book │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞════════════════╪══════╪═══════╡
│ 1 ┆ bob ┆ bookA │
│ 2 ┆ bob ┆ bookA │
│ 3 ┆ bob ┆ bookB │
│ 4 ┆ tim ┆ bookA │
│ 5 ┆ lucy ┆ bookA │
│ 6 ┆ lucy ┆ bookC │
│ 7 ┆ lucy ┆ bookC │
│ 8 ┆ lucy ┆ bookC │
└────────────────┴──────┴───────┘
""")
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
│ tim ┆ bookA │
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
df2 = df.group_by("user", "book").agg(
pl.col("book").count().alias("book_count"),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
)
print(df2)
df3 = df2.sort("user", "book_count", descending=True).group_by("user").agg(
pl.col("book").first().alias("fav_book")
)
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.
We could first group_by user
and 'book' to get all user -> book
combinations and count the most occurring.
This would give this intermediate DataFrame
:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
│ lucy ┆ bookA ┆ 1 │
│ bob ┆ bookB ┆ 1 │
│ tim ┆ bookA ┆ 1 │
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another group_by user
where we compute the index of the maximum
book_count
and use that index to take
the correct book
.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.group_by("user", "book")
.agg(
pl.col("book").count().alias("book_count")
)
.group_by("user")
.agg(
pl.col("book").gather(pl.col("book_count").arg_max()).alias("fav_book")
)
)
And creates this output:
shape: (3, 2)
┌──────┬───────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ list[str] │
╞══════╪═══════════╡
│ bob ┆ ["bookA"] │
│ lucy ┆ ["bookC"] │
│ tim ┆ ["bookA"] │
└──────┴───────────┘
Another approach would be creating a book_count
column with a window_expression
and then use the index of the maximum to take
the correct book in aggregation:
(df
.with_columns(pl.count("book").over("user", "book").alias("book_count"))
.group_by("user")
.agg(
pl.col("book").gather(pl.col("book_count").arg_max())
)
)