Search code examples
pythonpython-polars

How to do regression (simple linear for example) in polars select or groupby context?


I am using polars in place of pandas. I am quite amazed by the speed and lazy computation/evaluation. Right now, there are a lot of methods on lazy dataframe, but they can only drive me so far.

So, I am wondering what is the best way to use polars in combination with other tools to achieve more complicated operations, such as regression/model fitting.

To be more specific, I will give an example involving linear regression.

Assume I have a polars dataframe with columns day, y, x1 and x2, and I want to generate a series, which is the residual of regressing y on x1 and x2 group by day. I have included the code example as follows and how it can be solved using pandas and statsmodels. How can I get the same result with the most efficiency using idiomatic polars?

import pandas as pd
import statsmodels.api as sm

def regress_resid(df, yvar, xvars):
    result = sm.OLS(df[yvar], sm.add_constant(df[xvars])).fit()
    return result.resid

df = pd.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
)

df.groupby("day").apply(regress_resid, "y", ["x1", "x2"])
# day
# 1    0    0.772431
#      1   -0.689233
#      2   -1.167210
#      3   -0.827896
#      4    1.911909
# 2    5   -0.851691
#      6    1.719451
#      7   -1.167727
#      8    0.354871
#      9   -0.054905

Thanks for your help.


Solution

  • If you want to pass multiple columns to a function, you have to pack them into a Struct as polars expression always map from Series -> Series.

    Because polars does not use numpy memory which statsmodels does, you must convert the polars types to_numpy. This is often free in case of 1D structures.

    Finally, the function should not return a numpy array, but a polars Series instead, so we convert the result.

    import polars as pl
    from functools import partial
    import statsmodels.api as sm
    
    def regress_resid(s: pl.Series, yvar: str, xvars: list[str]) -> pl.Series:
        df = s.struct.unnest()
        yvar = df[yvar].to_numpy()
        xvars = df[xvars].to_numpy()
        
        result = sm.OLS(yvar, sm.add_constant(xvars)).fit()
        return pl.Series(result.resid)
        
    
    df = pl.DataFrame(
        {
            "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
            "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
            "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
            "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
        }
    )
    
    (df.group_by("day")
       .agg(
           pl.struct("y", "x1", "x2").map_elements(partial(regress_resid, yvar="y", xvars=["x1", "x2"]))
       )
    )