I need to compute cross-sectional rankings across a number of trading securities. Consider the following pl.DataFrame
in long (tidy) format. It comprises three different symbols with respective prices, where each symbol also has a dedicated (i.e. local) trading calendar.
df = pl.DataFrame(
{
"symbol": [*["symbol1"] * 6, *["symbol2"] * 5, *["symbol3"] * 5],
"date": [
"2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
"2023-12-30", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
"2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05",
],
"price": [
100, 105, 110, 115, 120, 125,
200, 210, 220, 230, 240,
3000, 3100, 3200, 3300, 3400,
],
}
)
print(df)
shape: (16, 3)
┌─────────┬────────────┬───────┐
│ symbol ┆ date ┆ price │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪════════════╪═══════╡
│ symbol1 ┆ 2023-12-30 ┆ 100 │
│ symbol1 ┆ 2023-12-31 ┆ 105 │
│ symbol1 ┆ 2024-01-03 ┆ 110 │
│ symbol1 ┆ 2024-01-04 ┆ 115 │
│ symbol1 ┆ 2024-01-05 ┆ 120 │
│ … ┆ … ┆ … │
│ symbol3 ┆ 2023-12-30 ┆ 3000 │
│ symbol3 ┆ 2023-12-31 ┆ 3100 │
│ symbol3 ┆ 2024-01-03 ┆ 3200 │
│ symbol3 ┆ 2024-01-04 ┆ 3300 │
│ symbol3 ┆ 2024-01-05 ┆ 3400 │
└─────────┴────────────┴───────┘
The first step is to compute the periodic returns using pct_change
and subsequently using pivot
to align the symbols per date.
returns = df.drop_nulls().with_columns(
pl.col("price").pct_change(n=2).over("symbol").alias("return")
).pivot(on="symbol", index="date", values="return")
print(returns)
shape: (6, 4)
┌────────────┬──────────┬──────────┬──────────┐
│ date ┆ symbol1 ┆ symbol2 ┆ symbol3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪══════════╪══════════╪══════════╡
│ 2023-12-30 ┆ null ┆ null ┆ null │
│ 2023-12-31 ┆ null ┆ null ┆ null │
│ 2024-01-03 ┆ 0.1 ┆ null ┆ 0.066667 │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1 ┆ 0.064516 │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625 │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null │
└────────────┴──────────┴──────────┴──────────┘
The next step is to use concat_list
to create a list
to compute the ranks per row (descending, i.e. highest return gets rank 1).
ranks = (
returns.with_columns(all_symbols=pl.concat_list(pl.all().exclude("date")))
.select(
pl.all().exclude("all_symbols"),
pl.col("all_symbols")
.list.eval(
pl.element().rank(descending=True, method="ordinal").cast(pl.UInt8)
)
.alias("rank"),
)
)
print(ranks)
shape: (6, 5)
┌────────────┬──────────┬──────────┬──────────┬────────────────────┐
│ date ┆ symbol1 ┆ symbol2 ┆ symbol3 ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ list[u8] │
╞════════════╪══════════╪══════════╪══════════╪════════════════════╡
│ 2023-12-30 ┆ null ┆ null ┆ null ┆ [null, null, null] │
│ 2023-12-31 ┆ null ┆ null ┆ null ┆ [null, null, null] │
│ 2024-01-03 ┆ 0.1 ┆ null ┆ 0.066667 ┆ [1, null, 2] │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1 ┆ 0.064516 ┆ [2, 1, 3] │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625 ┆ [2, 1, 3] │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null ┆ [2, 1, null] │
└────────────┴──────────┴──────────┴──────────┴────────────────────┘
Now we are finally getting to the actual question:
I would like to unpivot ranks
again and produce a tidy dataframe. I am looking for the following columns: symbol
, date
, return
, and rank
. I was thinking about creating three new columns (basically using explode
to unpack the list, but this will only create new rows rather than columns).
Also, I am wondering if I am required to pivot df
in the first place or if there's a better way to directly operate on the original df
in tidy format? I am actually looking for performance as df
could have millions of rows.
Well you can simplify the process without the need of explode
and to avoid the need to pivot and unpivot:
returns = df.drop_nulls().with_columns(
pl.col("price").pct_change(n=2).over("symbol").alias("return")
)
shape: (16, 4)
┌─────────┬────────────┬───────┬──────────┐
│ symbol ┆ date ┆ price ┆ return │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 │
╞═════════╪════════════╪═══════╪══════════╡
│ symbol1 ┆ 2023-12-30 ┆ 100 ┆ null │
│ symbol1 ┆ 2023-12-31 ┆ 105 ┆ null │
│ symbol1 ┆ 2024-01-03 ┆ 110 ┆ 0.1 │
│ symbol1 ┆ 2024-01-04 ┆ 115 ┆ 0.095238 │
│ symbol1 ┆ 2024-01-05 ┆ 120 ┆ 0.090909 │
│ … ┆ … ┆ … ┆ … │
│ symbol3 ┆ 2023-12-30 ┆ 3000 ┆ null │
│ symbol3 ┆ 2023-12-31 ┆ 3100 ┆ null │
│ symbol3 ┆ 2024-01-03 ┆ 3200 ┆ 0.066667 │
│ symbol3 ┆ 2024-01-04 ┆ 3300 ┆ 0.064516 │
│ symbol3 ┆ 2024-01-05 ┆ 3400 ┆ 0.0625 │
└─────────┴────────────┴───────┴──────────┘
Next rank the return values:
ranked_returns = returns.with_columns(
pl.col("return").rank(descending=True).over("date").cast(pl.UInt8).alias("rank")
)
shape: (16, 5)
┌─────────┬────────────┬───────┬──────────┬──────┐
│ symbol ┆ date ┆ price ┆ return ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 ┆ u8 │
╞═════════╪════════════╪═══════╪══════════╪══════╡
│ symbol1 ┆ 2023-12-30 ┆ 100 ┆ null ┆ null │
│ symbol1 ┆ 2023-12-31 ┆ 105 ┆ null ┆ null │
│ symbol1 ┆ 2024-01-03 ┆ 110 ┆ 0.1 ┆ 1 │
│ symbol1 ┆ 2024-01-04 ┆ 115 ┆ 0.095238 ┆ 2 │
│ symbol1 ┆ 2024-01-05 ┆ 120 ┆ 0.090909 ┆ 2 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ symbol3 ┆ 2023-12-30 ┆ 3000 ┆ null ┆ null │
│ symbol3 ┆ 2023-12-31 ┆ 3100 ┆ null ┆ null │
│ symbol3 ┆ 2024-01-03 ┆ 3200 ┆ 0.066667 ┆ 2 │
│ symbol3 ┆ 2024-01-04 ┆ 3300 ┆ 0.064516 ┆ 3 │
│ symbol3 ┆ 2024-01-05 ┆ 3400 ┆ 0.0625 ┆ 3 │
└─────────┴────────────┴───────┴──────────┴──────┘
And select only the symbol
, date
, return
, and rank
columns:
tidy_df = ranked_returns.select(["symbol", "date", "return", "rank"])
shape: (16, 4)
┌─────────┬────────────┬──────────┬──────┐
│ symbol ┆ date ┆ return ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ u8 │
╞═════════╪════════════╪══════════╪══════╡
│ symbol1 ┆ 2023-12-30 ┆ null ┆ null │
│ symbol1 ┆ 2023-12-31 ┆ null ┆ null │
│ symbol1 ┆ 2024-01-03 ┆ 0.1 ┆ 1 │
│ symbol1 ┆ 2024-01-04 ┆ 0.095238 ┆ 2 │
│ symbol1 ┆ 2024-01-05 ┆ 0.090909 ┆ 2 │
│ … ┆ … ┆ … ┆ … │
│ symbol3 ┆ 2023-12-30 ┆ null ┆ null │
│ symbol3 ┆ 2023-12-31 ┆ null ┆ null │
│ symbol3 ┆ 2024-01-03 ┆ 0.066667 ┆ 2 │
│ symbol3 ┆ 2024-01-04 ┆ 0.064516 ┆ 3 │
│ symbol3 ┆ 2024-01-05 ┆ 0.0625 ┆ 3 │
└─────────┴────────────┴──────────┴──────┘