Search code examples
python-polars

in polars, how could i use rank() to get most popular category per user


Let's say I have a DataFrame:

df = pl.from_repr("""
┌────────────────┬──────┬───────┐
│ transaction_id ┆ user ┆ book  │
│ ---            ┆ ---  ┆ ---   │
│ i64            ┆ str  ┆ str   │
╞════════════════╪══════╪═══════╡
│ 1              ┆ bob  ┆ bookA │
│ 2              ┆ bob  ┆ bookA │
│ 3              ┆ bob  ┆ bookB │
│ 4              ┆ tim  ┆ bookA │
│ 5              ┆ lucy ┆ bookA │
│ 6              ┆ lucy ┆ bookC │
│ 7              ┆ lucy ┆ bookC │
│ 8              ┆ lucy ┆ bookC │
└────────────────┴──────┴───────┘
""")

per user, i want to find the book they have shown the most preference towards. For example, the output should be;

shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ ---  ┆ ---      │
│ str  ┆ str      │
╞══════╪══════════╡
│ bob  ┆ bookA    │
│ tim  ┆ bookA    │
│ lucy ┆ bookC    │
└──────┴──────────┘

now i've worked out how to do it like so

df2 = df.group_by("user", "book").agg(
  pl.col("book").count().alias("book_count"),
  pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
)

print(df2)

df3 = df2.sort("user", "book_count", descending=True).group_by("user").agg(
  pl.col("book").first().alias("fav_book")
)

print(df3)

but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.

the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.

Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.


Solution

  • Approach 1

    We could first group_by user and 'book' to get all user -> book combinations and count the most occurring.

    This would give this intermediate DataFrame:

    shape: (5, 3)
    ┌──────┬───────┬────────────┐
    │ user ┆ book  ┆ book_count │
    │ ---  ┆ ---   ┆ ---        │
    │ str  ┆ str   ┆ u32        │
    ╞══════╪═══════╪════════════╡
    │ lucy ┆ bookC ┆ 3          │
    │ lucy ┆ bookA ┆ 1          │
    │ bob  ┆ bookB ┆ 1          │
    │ tim  ┆ bookA ┆ 1          │
    │ bob  ┆ bookA ┆ 2          │
    └──────┴───────┴────────────┘
    
    

    Then we can do another group_by user where we compute the index of the maximum book_count and use that index to take the correct book.

    The whole query looks like this:

    df = pl.DataFrame({'book': ['bookA',
              'bookA',
              'bookB',
              'bookA',
              'bookA',
              'bookC',
              'bookC',
              'bookC'],
     'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
     'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
    })
    
    (df.group_by("user", "book")
     .agg(
         pl.col("book").count().alias("book_count")
     )
     .group_by("user")
     .agg(
         pl.col("book").gather(pl.col("book_count").arg_max()).alias("fav_book")
     )
    )
    

    And creates this output:

    shape: (3, 2)
    ┌──────┬───────────┐
    │ user ┆ fav_book  │
    │ ---  ┆ ---       │
    │ str  ┆ list[str] │
    ╞══════╪═══════════╡
    │ bob  ┆ ["bookA"] │
    │ lucy ┆ ["bookC"] │
    │ tim  ┆ ["bookA"] │
    └──────┴───────────┘
    

    Approach 2

    Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:

    (df
     .with_columns(pl.count("book").over("user", "book").alias("book_count"))
     .group_by("user")
     .agg(
         pl.col("book").gather(pl.col("book_count").arg_max())
     )
    )