Search code examples
pythonstatisticsstatsmodelsanova

how to compare hierarchical regression models in python?


I fitted two regression models, One with only 1 predictor and the another with 3 predictors. Now I want to compare these two models. How can I do that? I know how to do it in R but not sure how to do it in python. Here is the code in R for comparing the two models -

anova(albumSales.2, albumSales.3)

Result -

Model 1: sales ~ adverts
Model 2: sales ~ adverts + airplay + attract
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    198 862264                                  
2    196 434575  2    427690 96.447 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 

Based on the above result we can see that albumSales.3 significantly improved the fit of the model to the data compared to albumSales.2, F(2, 196) = 96.44, p < .001.

How can I do it in python?


Solution

  • In the anova, you basically calculate the difference in RSS. You can check more under the vignette for ANOVA in statsmodels:

    import pandas as pd
    import seaborn as sns
    import numpy as np
    
    iris = sns.load_dataset('iris')
    
    from statsmodels.formula.api import ols
    from statsmodels.stats.anova import anova_lm
    
    iris.head()
    
        sepal_length    sepal_width petal_length    petal_width species
    0   5.1 3.5 1.4 0.2 setosa
    1   4.9 3.0 1.4 0.2 setosa
    2   4.7 3.2 1.3 0.2 setosa
    3   4.6 3.1 1.5 0.2 setosa
    4   5.0 3.6 1.4 0.2 setosa
    

    We run two models and do the anova:

    full_lm = ols("sepal_length ~ petal_length+petal_width", data=iris).fit()
    reduced_lm = ols("sepal_length ~ petal_length", data=iris).fit()
    anova_lm(reduced_lm,full_lm)
    
        df_resid    ssr df_diff ss_diff F   Pr(>F)
    0   148.0   24.525034   0.0 NaN NaN NaN
    1   147.0   23.880694   1.0 0.64434 3.9663  0.048272
    

    It throws some warning (you can see it on the website I linked above) because for the first row it cannot calculate the F etc.

    Note, this is different from calculating the Rsquare as proposed in the other answer. One important issue to note is that if you include more terms, your R-squared would theoretically increase, and you want to see whether the terms are significantly explaining additional variance, which is why you use an anova.