Search code examples
pythonpython-polars

How to return multiple stats as multiple columns in Polars grouby context?


The task at hand is to do multiple linear regression over multiple columns in groupby context and return respective beta coefficients and their associated t-values in separate columns.

Below is an illustration of an attempt to do such using statsmodels.

import numpy as np
import polars as pl
import statsmodels.api as sm

from functools import partial


def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    yvar = df[yvar].to_numpy()
    xvars = df[xvars].to_numpy()
    reg = sm.OLS(yvar, sm.add_constant(xvars), missing="drop").fit()
    return np.concatenate((reg.params, reg.tvalues))



df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3, 1],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3, 1],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1, 1],
    }
)

df.group_by("day").agg(
    pl.struct("y", "x1", "x2")
    .map_elements(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

The result from the code snippet above evaluates to

shape: (3, 2)
┌─────┬─────────────────────────────────┐
│ day ┆ params                          │
│ --- ┆ ---                             │
│ i64 ┆ object                          │
╞═════╪═════════════════════════════════╡
│ 2   ┆ [2.0462002  0.22397054 0.33679… │
│ 1   ┆ [ 4.86623165  0.64029364 -0.65… │
│ 3   ┆ [0.5 0.5 0.  0. ]               │
└─────┴─────────────────────────────────┘

How am I supposed to split the 'params' into separate columns with one scalar value in each column?

Also, my code seems to fail at some corner cases. Below is one of them.

df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3, None],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3, 1],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1, 1],
    }
)

df.group_by("day").agg(
    pl.struct("y", "x1", "x2")
    .map_elements(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

# ComputeError: ValueError: exog is not 1d or 2d

How can I make the code robust to such case?

Thanks for your help. And feel free to suggest your own solution.


Solution

  • Note that params column ends up with the object dtype - this is not what you want.

    You can specify the return_dtype to avoid this.

    (df.group_by("day")
       .agg(
          pl.struct("y", "x1", "x2").map_elements(
             partial(ols_stats, yvar="y", xvars=["x1", "x2"]),
             return_dtype = pl.List(pl.Float64)
          )
          .alias("params")
       )
    )
    
    shape: (3, 2)
    ┌─────┬─────────────────────────────────┐
    │ day ┆ params                          │
    │ --- ┆ ---                             │
    │ i64 ┆ list[f64]                       │ # list[f64] ✓
    ╞═════╪═════════════════════════════════╡
    │ 1   ┆ [4.866232, 0.640294, … -1.4306… │
    │ 2   ┆ [2.0462, 0.223971, … 1.091109]  │
    │ 3   ┆ [0.5, 0.5, … 0.0]               │
    └─────┴─────────────────────────────────┘
    

    One way to convert to columns is with .list.to_struct() which you can then .unnest()

       .with_columns(pl.col("params").list.to_struct("max_width"))
       .unnest("params")
    )
    
    shape: (3, 7)
    ┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
    │ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  ┆ field_4  ┆ field_5   │
    │ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---       │
    │ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      ┆ f64      ┆ f64       │
    ╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
    │ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 ┆ 0.495378 ┆ 1.091109  │
    │ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      ┆ null     ┆ null      │
    │ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586  ┆ -1.430613 │
    └─────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────────┘