Computing cross-sectional rankings using a tidy polars dataframe

I need to compute cross-sectional rankings across a number of trading securities. Consider the following pl.DataFrame in long (tidy) format. It comprises three different symbols with respective prices, where each symbol also has a dedicated (i.e. local) trading calendar.

df = pl.DataFrame(
    {
        "symbol": [*["symbol1"] * 6, *["symbol2"] * 5, *["symbol3"] * 5],
        "date": [
            "2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
            "2023-12-30", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
            "2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05",
        ],
        "price": [
            100, 105, 110, 115, 120, 125,
            200, 210, 220, 230, 240,
            3000, 3100, 3200, 3300, 3400,
        ],
    }
)

print(df)
shape: (16, 3)
┌─────────┬────────────┬───────┐
│ symbol  ┆ date       ┆ price │
│ ---     ┆ ---        ┆ ---   │
│ str     ┆ str        ┆ i64   │
╞═════════╪════════════╪═══════╡
│ symbol1 ┆ 2023-12-30 ┆ 100   │
│ symbol1 ┆ 2023-12-31 ┆ 105   │
│ symbol1 ┆ 2024-01-03 ┆ 110   │
│ symbol1 ┆ 2024-01-04 ┆ 115   │
│ symbol1 ┆ 2024-01-05 ┆ 120   │
│ …       ┆ …          ┆ …     │
│ symbol3 ┆ 2023-12-30 ┆ 3000  │
│ symbol3 ┆ 2023-12-31 ┆ 3100  │
│ symbol3 ┆ 2024-01-03 ┆ 3200  │
│ symbol3 ┆ 2024-01-04 ┆ 3300  │
│ symbol3 ┆ 2024-01-05 ┆ 3400  │
└─────────┴────────────┴───────┘

The first step is to compute the periodic returns using pct_change and subsequently using pivot to align the symbols per date.

returns = df.drop_nulls().with_columns(
    pl.col("price").pct_change(n=2).over("symbol").alias("return")
).pivot(on="symbol", index="date", values="return")

print(returns)
shape: (6, 4)
┌────────────┬──────────┬──────────┬──────────┐
│ date       ┆ symbol1  ┆ symbol2  ┆ symbol3  │
│ ---        ┆ ---      ┆ ---      ┆ ---      │
│ str        ┆ f64      ┆ f64      ┆ f64      │
╞════════════╪══════════╪══════════╪══════════╡
│ 2023-12-30 ┆ null     ┆ null     ┆ null     │
│ 2023-12-31 ┆ null     ┆ null     ┆ null     │
│ 2024-01-03 ┆ 0.1      ┆ null     ┆ 0.066667 │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1      ┆ 0.064516 │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625   │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null     │
└────────────┴──────────┴──────────┴──────────┘

The next step is to use concat_list to create a list to compute the ranks per row (descending, i.e. highest return gets rank 1).

ranks = (
    returns.with_columns(all_symbols=pl.concat_list(pl.all().exclude("date")))
    .select(
        pl.all().exclude("all_symbols"),
        pl.col("all_symbols")
        .list.eval(
            pl.element().rank(descending=True, method="ordinal").cast(pl.UInt8)
        )
        .alias("rank"),
    )
)

print(ranks)
shape: (6, 5)
┌────────────┬──────────┬──────────┬──────────┬────────────────────┐
│ date       ┆ symbol1  ┆ symbol2  ┆ symbol3  ┆ rank               │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---                │
│ str        ┆ f64      ┆ f64      ┆ f64      ┆ list[u8]           │
╞════════════╪══════════╪══════════╪══════════╪════════════════════╡
│ 2023-12-30 ┆ null     ┆ null     ┆ null     ┆ [null, null, null] │
│ 2023-12-31 ┆ null     ┆ null     ┆ null     ┆ [null, null, null] │
│ 2024-01-03 ┆ 0.1      ┆ null     ┆ 0.066667 ┆ [1, null, 2]       │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1      ┆ 0.064516 ┆ [2, 1, 3]          │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625   ┆ [2, 1, 3]          │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null     ┆ [2, 1, null]       │
└────────────┴──────────┴──────────┴──────────┴────────────────────┘

Now we are finally getting to the actual question:
I would like to unpivot ranks again and produce a tidy dataframe. I am looking for the following columns: symbol, date, return, and rank. I was thinking about creating three new columns (basically using explode to unpack the list, but this will only create new rows rather than columns).

Also, I am wondering if I am required to pivot df in the first place or if there's a better way to directly operate on the original df in tidy format? I am actually looking for performance as df could have millions of rows.

Solution

Well you can simplify the process without the need of explode and to avoid the need to pivot and unpivot:

returns = df.drop_nulls().with_columns(
    pl.col("price").pct_change(n=2).over("symbol").alias("return")
)

shape: (16, 4)
┌─────────┬────────────┬───────┬──────────┐
│ symbol  ┆ date       ┆ price ┆ return   │
│ ---     ┆ ---        ┆ ---   ┆ ---      │
│ str     ┆ str        ┆ i64   ┆ f64      │
╞═════════╪════════════╪═══════╪══════════╡
│ symbol1 ┆ 2023-12-30 ┆ 100   ┆ null     │
│ symbol1 ┆ 2023-12-31 ┆ 105   ┆ null     │
│ symbol1 ┆ 2024-01-03 ┆ 110   ┆ 0.1      │
│ symbol1 ┆ 2024-01-04 ┆ 115   ┆ 0.095238 │
│ symbol1 ┆ 2024-01-05 ┆ 120   ┆ 0.090909 │
│ …       ┆ …          ┆ …     ┆ …        │
│ symbol3 ┆ 2023-12-30 ┆ 3000  ┆ null     │
│ symbol3 ┆ 2023-12-31 ┆ 3100  ┆ null     │
│ symbol3 ┆ 2024-01-03 ┆ 3200  ┆ 0.066667 │
│ symbol3 ┆ 2024-01-04 ┆ 3300  ┆ 0.064516 │
│ symbol3 ┆ 2024-01-05 ┆ 3400  ┆ 0.0625   │
└─────────┴────────────┴───────┴──────────┘

Next rank the return values:

ranked_returns = returns.with_columns(
    pl.col("return").rank(descending=True).over("date").cast(pl.UInt8).alias("rank")
)

shape: (16, 5)
┌─────────┬────────────┬───────┬──────────┬──────┐
│ symbol  ┆ date       ┆ price ┆ return   ┆ rank │
│ ---     ┆ ---        ┆ ---   ┆ ---      ┆ ---  │
│ str     ┆ str        ┆ i64   ┆ f64      ┆ u8   │
╞═════════╪════════════╪═══════╪══════════╪══════╡
│ symbol1 ┆ 2023-12-30 ┆ 100   ┆ null     ┆ null │
│ symbol1 ┆ 2023-12-31 ┆ 105   ┆ null     ┆ null │
│ symbol1 ┆ 2024-01-03 ┆ 110   ┆ 0.1      ┆ 1    │
│ symbol1 ┆ 2024-01-04 ┆ 115   ┆ 0.095238 ┆ 2    │
│ symbol1 ┆ 2024-01-05 ┆ 120   ┆ 0.090909 ┆ 2    │
│ …       ┆ …          ┆ …     ┆ …        ┆ …    │
│ symbol3 ┆ 2023-12-30 ┆ 3000  ┆ null     ┆ null │
│ symbol3 ┆ 2023-12-31 ┆ 3100  ┆ null     ┆ null │
│ symbol3 ┆ 2024-01-03 ┆ 3200  ┆ 0.066667 ┆ 2    │
│ symbol3 ┆ 2024-01-04 ┆ 3300  ┆ 0.064516 ┆ 3    │
│ symbol3 ┆ 2024-01-05 ┆ 3400  ┆ 0.0625   ┆ 3    │
└─────────┴────────────┴───────┴──────────┴──────┘

And select only the symbol, date, return, and rank columns:

tidy_df = ranked_returns.select(["symbol", "date", "return", "rank"])

shape: (16, 4)
┌─────────┬────────────┬──────────┬──────┐
│ symbol  ┆ date       ┆ return   ┆ rank │
│ ---     ┆ ---        ┆ ---      ┆ ---  │
│ str     ┆ str        ┆ f64      ┆ u8   │
╞═════════╪════════════╪══════════╪══════╡
│ symbol1 ┆ 2023-12-30 ┆ null     ┆ null │
│ symbol1 ┆ 2023-12-31 ┆ null     ┆ null │
│ symbol1 ┆ 2024-01-03 ┆ 0.1      ┆ 1    │
│ symbol1 ┆ 2024-01-04 ┆ 0.095238 ┆ 2    │
│ symbol1 ┆ 2024-01-05 ┆ 0.090909 ┆ 2    │
│ …       ┆ …          ┆ …        ┆ …    │
│ symbol3 ┆ 2023-12-30 ┆ null     ┆ null │
│ symbol3 ┆ 2023-12-31 ┆ null     ┆ null │
│ symbol3 ┆ 2024-01-03 ┆ 0.066667 ┆ 2    │
│ symbol3 ┆ 2024-01-04 ┆ 0.064516 ┆ 3    │
│ symbol3 ┆ 2024-01-05 ┆ 0.0625   ┆ 3    │
└─────────┴────────────┴──────────┴──────┘