I am using polars in place of pandas. I am quite amazed by the speed and lazy computation/evaluation. Right now, there are a lot of methods on lazy dataframe, but they can only drive me so far.
So, I am wondering what is the best way to use polars in combination with other tools to achieve more complicated operations, such as regression/model fitting.
To be more specific, I will give an example involving linear regression.
Assume I have a polars dataframe with columns day, y, x1 and x2, and I want to generate a series, which is the residual of regressing y on x1 and x2 group by day. I have included the code example as follows and how it can be solved using pandas and statsmodels. How can I get the same result with the most efficiency using idiomatic polars?
import pandas as pd
import statsmodels.api as sm
def regress_resid(df, yvar, xvars):
result = sm.OLS(df[yvar], sm.add_constant(df[xvars])).fit()
return result.resid
df = pd.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
)
df.groupby("day").apply(regress_resid, "y", ["x1", "x2"])
# day
# 1 0 0.772431
# 1 -0.689233
# 2 -1.167210
# 3 -0.827896
# 4 1.911909
# 2 5 -0.851691
# 6 1.719451
# 7 -1.167727
# 8 0.354871
# 9 -0.054905
Thanks for your help.
If you want to pass multiple columns to a function, you have to pack them into a Struct
as polars expression always map from Series -> Series
.
Because polars does not use numpy
memory which statsmodels
does, you must convert the polars types to_numpy
. This is often free in case of 1D structures.
Finally, the function should not return a numpy array, but a polars Series
instead, so we convert the result.
import polars as pl
from functools import partial
import statsmodels.api as sm
def regress_resid(s: pl.Series, yvar: str, xvars: list[str]) -> pl.Series:
df = s.struct.unnest()
yvar = df[yvar].to_numpy()
xvars = df[xvars].to_numpy()
result = sm.OLS(yvar, sm.add_constant(xvars)).fit()
return pl.Series(result.resid)
df = pl.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
)
(df.group_by("day")
.agg(
pl.struct("y", "x1", "x2").map_elements(partial(regress_resid, yvar="y", xvars=["x1", "x2"]))
)
)