Search code examples
pythonpandasstatsmodels

using statsmodels with a groupby


Consider this simple example

import pandas as pd
import statsmodels.formula.api as sm

df = pd.DataFrame({'Y' : [1,2,3,4,5,6,7],
                   'X' : [2,3,4,5,6,3,2],
                   'group' : ['a','a','a','a','b','b','b']})

df
Out[99]: 
   Y  X group
0  1  2     a
1  2  3     a
2  3  4     a
3  4  5     a
4  5  6     b
5  6  3     b
6  7  2     b

I would like to run a regression by group. I only have found very old answers or solutions with a loop. I just wonder why the very simple:

df.groupby('group').agg(lambda x: sm.ols(formula = 'Y ~ X', data = x))
PatsyError: Error evaluating factor: NameError: name 'X' is not defined
    Y ~ X

does not work. Can we do better with the latest versions of Pandas (1.2.3)? Thanks!


Solution

  • You need to use the apply function -

    df.groupby('group').apply(lambda x: sm.ols(formula = 'Y ~ X', data = x))
    

    Output

    group
    a    <statsmodels.regression.linear_model.OLS objec...
    b    <statsmodels.regression.linear_model.OLS objec...
    dtype: object
    

    You now have a model for every group fit and ready to go.