Search code examples
pythonpython-polars

Polars dataframe: add columns conditional on other column yielding different lengths


I have the following dataframe in polars.

df = pl.DataFrame({
    "Buy_Signal": [1, 0, 1, 0, 0], 
    "Returns": np.random.normal(0, 0.1, 5),
})

Ultimately, I want to do aggregations on column Returns conditional on different intervals - which are given by column Buy_Signal. In the above case the length is from each 1 to the end of the dataframe.

What is the most polar way to do this?

My "stupid" solution (before I can apply aggregations) is as follows:

df = (df
      .with_columns(pl.col("Buy_Signal").cum_sum().alias("Holdings"))

      # (1) Determine returns for each holding period (>=1, >=2) and unpivot into one column
      .with_columns(
            pl.when(pl.col("Holdings") >= 1).then(pl.col("Returns")).alias("Port_1"),
            pl.when(pl.col("Holdings") >= 2).then(pl.col("Returns")).alias("Port_2"),
            )
      ) 

The solution is obviously not working for many "buy signals". So I wrote a separate function calc_port_returns which includes a for loop which is then passed to polar's pipe function. See here:

def calc_port_returns(_df: pl.DataFrame) -> pl.DataFrame:
    n_portfolios = _df["Buy_Signal"].sum()
    data = pl.DataFrame()
    _df = _df.with_columns(pl.col("Buy_Signal").cum_sum().alias("Holdings"))
    for i in range(1, n_portfolios + 1):
        tmp = (
                _df.with_columns(
                pl.when(pl.col("Holdings") >= i).then(pl.col("Returns")).alias(f"Port_{i}"),
                )
                .select(pl.col(f"Port_{i}"))
                )
        data = pl.concat([data, tmp], how="horizontal")
    _df = pl.concat([_df, data], how="horizontal")
    return _df

df = pl.DataFrame({
    "Buy_Signal": [1, 0, 1, 0, 0], 
    "Returns": np.random.normal(0, 0.1, 5),
})

df.pipe(calc_port_returns)

What is the "polars" way to do this? In pandas I could imagine solving it using df.assign({f"Port_{i}": ... for i in range(1, ...)}) with a few prior extra columns / side calculations.

Thanks for any suggestions.


Solution

  • It's pretty similar to what you were thinking...

    (
        df
        .with_columns(
            Holdings = pl.col('Buy_Signal').cum_sum()
        )
        .with_columns(
            pl.when(pl.col("Holdings")>=x)
            .then(pl.col("Returns")).alias(f"Port_{x}")
            for x in range(1,df['Buy_Signal'].sum()+1)
            )
    )
    shape: (5, 5)
    ┌────────────┬───────────┬──────────┬───────────┬───────────┐
    │ Buy_Signal ┆ Returns   ┆ Holdings ┆ Port_1    ┆ Port_2    │
    │ ---        ┆ ---       ┆ ---      ┆ ---       ┆ ---       │
    │ i64        ┆ f64       ┆ i64      ┆ f64       ┆ f64       │
    ╞════════════╪═══════════╪══════════╪═══════════╪═══════════╡
    │ 1          ┆ -0.05515  ┆ 1        ┆ -0.05515  ┆ null      │
    │ 0          ┆ 0.205705  ┆ 1        ┆ 0.205705  ┆ null      │
    │ 1          ┆ -0.068856 ┆ 2        ┆ -0.068856 ┆ -0.068856 │
    │ 0          ┆ -0.141231 ┆ 2        ┆ -0.141231 ┆ -0.141231 │
    │ 0          ┆ -0.028524 ┆ 2        ┆ -0.028524 ┆ -0.028524 │
    └────────────┴───────────┴──────────┴───────────┴───────────┘