Search code examples
pythonstatisticslinear-regressionstatsmodels

How to interpret the output of statsmodels model.summary() for multivariate linear regression?


I'm using the statsmodels library to check for the impact of confounding variables on a dependent variable by performing multivariate linear regression:

model = ols(f'{metric}_diff ~ {" + ".join(confounding_variable_names)}', data=df).fit()

This is how my data looks like (pasted only 2 rows):

            Age     Sex  Experience using a gamepad (1-4)   Experience using a VR headset (1-4)  Experience using hand tracking (1-3)  Experience using controllers in VR (1-3) Glasses  ID_1  ID_2       Method_1       Method_2 ID_controller ID_handTracking  CorrectGestureCounter_controller  CorrectGestureCounter_handTracking  IncorrectGestureCounter_controller  IncorrectGestureCounter_handTracking
IDs                                                                                                                                                                                                                                                                                                                                                                                                          
ID_K_1_3     25  Female                                 4                                     3                                     1                                         2     Yes   K_1   K_3     controller   handTracking           K_1             K_3                                21                                  34                                   5                                     2
ID_K_4_5     19    Male                                 4                                     2                                     1                                         2     Yes   K_4   K_5     controller   handTracking           K_4             K_5                                21                                  36                                  14                                    17

When I execute model.summary() I get output like this:

                                OLS Regression Results                                
======================================================================================
Dep. Variable:     CorrectGestureCounter_diff   R-squared:                       0.477
Model:                                    OLS   Adj. R-squared:                  0.249
Method:                         Least Squares   F-statistic:                     2.088
Date:                        Wed, 28 Dec 2022   Prob (F-statistic):              0.105
Time:                                15:29:41   Log-Likelihood:                -73.565
No. Observations:                          24   AIC:                             163.1
Df Residuals:                              16   BIC:                             172.6
Df Model:                                   7                                         
Covariance Type:                    nonrobust                                         
==========================================================================================================
                                             coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------
Intercept                                -24.6404      9.326     -2.642      0.018     -44.410      -4.871
Sex[T.Male]                               -7.3225      3.170     -2.310      0.035     -14.043      -0.602
Glasses[T.Yes]                            -2.4210      2.995     -0.808      0.431      -8.771       3.929
Age                                        0.2957      0.183      1.613      0.126      -0.093       0.684
Experience_using_a_gamepad_1_4             1.8810      1.853      1.015      0.325      -2.047       5.809
Experience_using_a_VR_headset_1_4          0.9559      3.213      0.297      0.770      -5.856       7.768
Experience_using_hand_tracking_1_3        -2.4689      3.633     -0.680      0.506     -10.170       5.232
Experience_using_controllers_in_VR_1_3     2.3592      4.840      0.487      0.633      -7.902      12.620
==============================================================================
Omnibus:                        0.621   Durbin-Watson:                   2.566
Prob(Omnibus):                  0.733   Jarque-Bera (JB):                0.702
Skew:                          -0.277   Prob(JB):                        0.704
Kurtosis:                       2.371   Cond. No.                         205.
==============================================================================

What do the [T.Male] or [T.Yes] next to Sex and Glasses mean? How should I interpret this? Also why is Intercept added next to my variables? Should I care about it in the context of confounding variables?


Solution

  • This is more of a stats question but I'll do my best to help. A multivariate regression is of the form:

    Where, Y, B, and, U are vectors associated with the dependent variable, coefficients, and error terms respectively. X then is the design matrix that houses all of your predictor variables. Such as Age, Glasses, etc. Onto your question of the intercept, the above equation can be written as:

    Thus from this, we can determine that "beta naught" is an intercept that does not depend on any of your predictor variables that is to say that just like in y=mx+b basic slope formula-speak, that beta naught term is the intercept that your regression is showing. Meaning that if all other terms are zero, your response variable would start at -24.6404. This is sort of the base value of your regression, meaning this term is added to each and every prediction.

    As for the other variables i.e. glasses and sex..you basically have what is called a "dummy variable" that is to say:

    Where I(t) is an indicator function indicating if that's true or false, so your x vectors corresponding to Age and Sex are binary vectors. Thus in your example, Male (T.male) is encoded as a 1, and having glasses (T.Yes) is also a 1. Thus female and no glasses is a zero. Thus the interpretation is, if you are a male and wear glasses, add -7.3225 and -2.4210 respectively, else add nothing (because anything times zero is zero).

    Hope that helped! I can't say much about your specific use case because I don't know exactly what statistical questions you have but this is at least a quick crash course in understanding the output of your regression.