Search code examples
numpyscipystatisticslinear-regressionstatsmodels

Getting a different kurtosis from numpy array method than from Summary


I have to extract information from the OLS statsmodel summary. While doing it, the Kurtosis results from the summary is different than the array method kurtosis().

Here is the code:

from sklearn.datasets import load_diabetes
import pandas as pd
import statsmodels.api as sm


dic = load_diabetes()

df = pd.DataFrame(data=dic.data, columns=dic.feature_names)
y = dic.target

# %%

X = sm.add_constant(df)

model = sm.OLS(y, X)

res = model.fit()

print(res.summary2())
print(f'\n\nKurtosis by Array Method: {res.resid.kurtosis():.3f}')

Output:

"""
 Results: Ordinary least squares
==================================================================
Model:              OLS              Adj. R-squared:     0.507    
Dependent Variable: y                AIC:                4793.9857
Date:               2023-10-20 16:26 BIC:                4838.9901
No. Observations:   442              Log-Likelihood:     -2386.0  
Df Model:           10               F-statistic:        46.27    
Df Residuals:       431              Prob (F-statistic): 3.83e-62 
R-squared:          0.518            Scale:              2932.7   
-------------------------------------------------------------------
          Coef.    Std.Err.     t     P>|t|     [0.025      0.975] 
-------------------------------------------------------------------
const    152.1335    2.5759  59.0614  0.0000    147.0707   157.1963
age      -10.0099   59.7492  -0.1675  0.8670   -127.4460   107.4263
sex     -239.8156   61.2223  -3.9171  0.0001   -360.1471  -119.4841
bmi      519.8459   66.5334   7.8133  0.0000    389.0755   650.6163
bp       324.3846   65.4220   4.9583  0.0000    195.7988   452.9705
s1      -792.1756  416.6799  -1.9012  0.0579  -1611.1530    26.8017
s2       476.7390  339.0305   1.4062  0.1604   -189.6198  1143.0978
s3       101.0433  212.5315   0.4754  0.6347   -316.6838   518.7703
s4       177.0632  161.4758   1.0965  0.2735   -140.3147   494.4412
s5       751.2737  171.9000   4.3704  0.0000    413.4072  1089.1402
s6        67.6267   65.9843   1.0249  0.3060    -62.0643   197.3177
------------------------------------------------------------------
Omnibus:               1.506        Durbin-Watson:           2.029
Prob(Omnibus):         0.471        Jarque-Bera (JB):        1.404
Skew:                  0.017        Prob(JB):                0.496
Kurtosis:              2.726        Condition No.:           227  
==================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.


Kurtosis by Array Method: -0.264
Skewness by Array Method: 0.017
"""

I wanna know which one of the results are more reliable, and if I have to use the summary result, how to extract it. I'm also printing the skewness by array method in order to see if my approach is correct or if I'm doing something wrong.

I tried using the scipy stats function but the result is similar but not equal to the array method (-0.274).


Solution

  • This seems to be the difference between the Pearson kurtosis and Fisher (or excess) kurtosis. According to Wikipedia:

    It is common practice to use excess kurtosis, which is defined as Pearson's kurtosis minus 3, to provide a simple comparison to the normal distribution.

    When you subtract 3 from the kurtosis value in the summary, you obtain the same value as with scipy.stats.kurtosis. In fact, the function scipy.stats.kurtosis has an option fisher, which is True by default, but can be set to False to get the same result as in the summary:

    from scipy.stats import kurtosis
    kurtosis(res.resid)                # gives -0.2740841793704205
    kurtosis(res.resid, fisher=False)  # gives +2.7259158206295795
    

    So, my suggestion would be to use scipy.stats.kurtosis, because it lets you choose explicitly which definition of kurtosis you want.

    The pandas function res.resid.kurtosis() computes Fisher kurtosis, but seems to use a different implementation and thus gives a slightly different value. I would trust in Scipy.