Search code examples
pythonpython-polars

Computing cross-sectional rankings using a tidy polars dataframe


I need to compute cross-sectional rankings across a number of trading securities. Consider the following pl.DataFrame in long (tidy) format. It comprises three different symbols with respective prices, where each symbol also has a dedicated (i.e. local) trading calendar.

df = pl.DataFrame(
    {
        "symbol": [*["symbol1"] * 6, *["symbol2"] * 5, *["symbol3"] * 5],
        "date": [
            "2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
            "2023-12-30", "2024-01-03", "2024-01-04", "2024-01-05", "2024-01-06",
            "2023-12-30", "2023-12-31", "2024-01-03", "2024-01-04", "2024-01-05",
        ],
        "price": [
            100, 105, 110, 115, 120, 125,
            200, 210, 220, 230, 240,
            3000, 3100, 3200, 3300, 3400,
        ],
    }
)

print(df)
shape: (16, 3)
┌─────────┬────────────┬───────┐
│ symbol  ┆ date       ┆ price │
│ ---     ┆ ---        ┆ ---   │
│ str     ┆ str        ┆ i64   │
╞═════════╪════════════╪═══════╡
│ symbol1 ┆ 2023-12-30 ┆ 100   │
│ symbol1 ┆ 2023-12-31 ┆ 105   │
│ symbol1 ┆ 2024-01-03 ┆ 110   │
│ symbol1 ┆ 2024-01-04 ┆ 115   │
│ symbol1 ┆ 2024-01-05 ┆ 120   │
│ …       ┆ …          ┆ …     │
│ symbol3 ┆ 2023-12-30 ┆ 3000  │
│ symbol3 ┆ 2023-12-31 ┆ 3100  │
│ symbol3 ┆ 2024-01-03 ┆ 3200  │
│ symbol3 ┆ 2024-01-04 ┆ 3300  │
│ symbol3 ┆ 2024-01-05 ┆ 3400  │
└─────────┴────────────┴───────┘

The first step is to compute the periodic returns using pct_change and subsequently using pivot to align the symbols per date.

returns = df.drop_nulls().with_columns(
    pl.col("price").pct_change(n=2).over("symbol").alias("return")
).pivot(on="symbol", index="date", values="return")

print(returns)
shape: (6, 4)
┌────────────┬──────────┬──────────┬──────────┐
│ date       ┆ symbol1  ┆ symbol2  ┆ symbol3  │
│ ---        ┆ ---      ┆ ---      ┆ ---      │
│ str        ┆ f64      ┆ f64      ┆ f64      │
╞════════════╪══════════╪══════════╪══════════╡
│ 2023-12-30 ┆ null     ┆ null     ┆ null     │
│ 2023-12-31 ┆ null     ┆ null     ┆ null     │
│ 2024-01-03 ┆ 0.1      ┆ null     ┆ 0.066667 │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1      ┆ 0.064516 │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625   │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null     │
└────────────┴──────────┴──────────┴──────────┘

The next step is to use concat_list to create a list to compute the ranks per row (descending, i.e. highest return gets rank 1).

ranks = (
    returns.with_columns(all_symbols=pl.concat_list(pl.all().exclude("date")))
    .select(
        pl.all().exclude("all_symbols"),
        pl.col("all_symbols")
        .list.eval(
            pl.element().rank(descending=True, method="ordinal").cast(pl.UInt8)
        )
        .alias("rank"),
    )
)

print(ranks)
shape: (6, 5)
┌────────────┬──────────┬──────────┬──────────┬────────────────────┐
│ date       ┆ symbol1  ┆ symbol2  ┆ symbol3  ┆ rank               │
│ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---                │
│ str        ┆ f64      ┆ f64      ┆ f64      ┆ list[u8]           │
╞════════════╪══════════╪══════════╪══════════╪════════════════════╡
│ 2023-12-30 ┆ null     ┆ null     ┆ null     ┆ [null, null, null] │
│ 2023-12-31 ┆ null     ┆ null     ┆ null     ┆ [null, null, null] │
│ 2024-01-03 ┆ 0.1      ┆ null     ┆ 0.066667 ┆ [1, null, 2]       │
│ 2024-01-04 ┆ 0.095238 ┆ 0.1      ┆ 0.064516 ┆ [2, 1, 3]          │
│ 2024-01-05 ┆ 0.090909 ┆ 0.095238 ┆ 0.0625   ┆ [2, 1, 3]          │
│ 2024-01-06 ┆ 0.086957 ┆ 0.090909 ┆ null     ┆ [2, 1, null]       │
└────────────┴──────────┴──────────┴──────────┴────────────────────┘

Now we are finally getting to the actual question:
I would like to unpivot ranks again and produce a tidy dataframe. I am looking for the following columns: symbol, date, return, and rank. I was thinking about creating three new columns (basically using explode to unpack the list, but this will only create new rows rather than columns).

Also, I am wondering if I am required to pivot df in the first place or if there's a better way to directly operate on the original df in tidy format? I am actually looking for performance as df could have millions of rows.


Solution

  • Well you can simplify the process without the need of explode and to avoid the need to pivot and unpivot:

    returns = df.drop_nulls().with_columns(
        pl.col("price").pct_change(n=2).over("symbol").alias("return")
    )
    
    shape: (16, 4)
    ┌─────────┬────────────┬───────┬──────────┐
    │ symbol  ┆ date       ┆ price ┆ return   │
    │ ---     ┆ ---        ┆ ---   ┆ ---      │
    │ str     ┆ str        ┆ i64   ┆ f64      │
    ╞═════════╪════════════╪═══════╪══════════╡
    │ symbol1 ┆ 2023-12-30 ┆ 100   ┆ null     │
    │ symbol1 ┆ 2023-12-31 ┆ 105   ┆ null     │
    │ symbol1 ┆ 2024-01-03 ┆ 110   ┆ 0.1      │
    │ symbol1 ┆ 2024-01-04 ┆ 115   ┆ 0.095238 │
    │ symbol1 ┆ 2024-01-05 ┆ 120   ┆ 0.090909 │
    │ …       ┆ …          ┆ …     ┆ …        │
    │ symbol3 ┆ 2023-12-30 ┆ 3000  ┆ null     │
    │ symbol3 ┆ 2023-12-31 ┆ 3100  ┆ null     │
    │ symbol3 ┆ 2024-01-03 ┆ 3200  ┆ 0.066667 │
    │ symbol3 ┆ 2024-01-04 ┆ 3300  ┆ 0.064516 │
    │ symbol3 ┆ 2024-01-05 ┆ 3400  ┆ 0.0625   │
    └─────────┴────────────┴───────┴──────────┘
    

    Next rank the return values:

    ranked_returns = returns.with_columns(
        pl.col("return").rank(descending=True).over("date").cast(pl.UInt8).alias("rank")
    )
    
    shape: (16, 5)
    ┌─────────┬────────────┬───────┬──────────┬──────┐
    │ symbol  ┆ date       ┆ price ┆ return   ┆ rank │
    │ ---     ┆ ---        ┆ ---   ┆ ---      ┆ ---  │
    │ str     ┆ str        ┆ i64   ┆ f64      ┆ u8   │
    ╞═════════╪════════════╪═══════╪══════════╪══════╡
    │ symbol1 ┆ 2023-12-30 ┆ 100   ┆ null     ┆ null │
    │ symbol1 ┆ 2023-12-31 ┆ 105   ┆ null     ┆ null │
    │ symbol1 ┆ 2024-01-03 ┆ 110   ┆ 0.1      ┆ 1    │
    │ symbol1 ┆ 2024-01-04 ┆ 115   ┆ 0.095238 ┆ 2    │
    │ symbol1 ┆ 2024-01-05 ┆ 120   ┆ 0.090909 ┆ 2    │
    │ …       ┆ …          ┆ …     ┆ …        ┆ …    │
    │ symbol3 ┆ 2023-12-30 ┆ 3000  ┆ null     ┆ null │
    │ symbol3 ┆ 2023-12-31 ┆ 3100  ┆ null     ┆ null │
    │ symbol3 ┆ 2024-01-03 ┆ 3200  ┆ 0.066667 ┆ 2    │
    │ symbol3 ┆ 2024-01-04 ┆ 3300  ┆ 0.064516 ┆ 3    │
    │ symbol3 ┆ 2024-01-05 ┆ 3400  ┆ 0.0625   ┆ 3    │
    └─────────┴────────────┴───────┴──────────┴──────┘
    

    And select only the symbol, date, return, and rank columns:

    tidy_df = ranked_returns.select(["symbol", "date", "return", "rank"])
    
    shape: (16, 4)
    ┌─────────┬────────────┬──────────┬──────┐
    │ symbol  ┆ date       ┆ return   ┆ rank │
    │ ---     ┆ ---        ┆ ---      ┆ ---  │
    │ str     ┆ str        ┆ f64      ┆ u8   │
    ╞═════════╪════════════╪══════════╪══════╡
    │ symbol1 ┆ 2023-12-30 ┆ null     ┆ null │
    │ symbol1 ┆ 2023-12-31 ┆ null     ┆ null │
    │ symbol1 ┆ 2024-01-03 ┆ 0.1      ┆ 1    │
    │ symbol1 ┆ 2024-01-04 ┆ 0.095238 ┆ 2    │
    │ symbol1 ┆ 2024-01-05 ┆ 0.090909 ┆ 2    │
    │ …       ┆ …          ┆ …        ┆ …    │
    │ symbol3 ┆ 2023-12-30 ┆ null     ┆ null │
    │ symbol3 ┆ 2023-12-31 ┆ null     ┆ null │
    │ symbol3 ┆ 2024-01-03 ┆ 0.066667 ┆ 2    │
    │ symbol3 ┆ 2024-01-04 ┆ 0.064516 ┆ 3    │
    │ symbol3 ┆ 2024-01-05 ┆ 0.0625   ┆ 3    │
    └─────────┴────────────┴──────────┴──────┘