I have to extract information from the OLS statsmodel summary. While doing it, the Kurtosis results from the summary is different than the array method kurtosis()
.
Here is the code:
from sklearn.datasets import load_diabetes
import pandas as pd
import statsmodels.api as sm
dic = load_diabetes()
df = pd.DataFrame(data=dic.data, columns=dic.feature_names)
y = dic.target
# %%
X = sm.add_constant(df)
model = sm.OLS(y, X)
res = model.fit()
print(res.summary2())
print(f'\n\nKurtosis by Array Method: {res.resid.kurtosis():.3f}')
Output:
"""
Results: Ordinary least squares
==================================================================
Model: OLS Adj. R-squared: 0.507
Dependent Variable: y AIC: 4793.9857
Date: 2023-10-20 16:26 BIC: 4838.9901
No. Observations: 442 Log-Likelihood: -2386.0
Df Model: 10 F-statistic: 46.27
Df Residuals: 431 Prob (F-statistic): 3.83e-62
R-squared: 0.518 Scale: 2932.7
-------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
-------------------------------------------------------------------
const 152.1335 2.5759 59.0614 0.0000 147.0707 157.1963
age -10.0099 59.7492 -0.1675 0.8670 -127.4460 107.4263
sex -239.8156 61.2223 -3.9171 0.0001 -360.1471 -119.4841
bmi 519.8459 66.5334 7.8133 0.0000 389.0755 650.6163
bp 324.3846 65.4220 4.9583 0.0000 195.7988 452.9705
s1 -792.1756 416.6799 -1.9012 0.0579 -1611.1530 26.8017
s2 476.7390 339.0305 1.4062 0.1604 -189.6198 1143.0978
s3 101.0433 212.5315 0.4754 0.6347 -316.6838 518.7703
s4 177.0632 161.4758 1.0965 0.2735 -140.3147 494.4412
s5 751.2737 171.9000 4.3704 0.0000 413.4072 1089.1402
s6 67.6267 65.9843 1.0249 0.3060 -62.0643 197.3177
------------------------------------------------------------------
Omnibus: 1.506 Durbin-Watson: 2.029
Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404
Skew: 0.017 Prob(JB): 0.496
Kurtosis: 2.726 Condition No.: 227
==================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
Kurtosis by Array Method: -0.264
Skewness by Array Method: 0.017
"""
I wanna know which one of the results are more reliable, and if I have to use the summary result, how to extract it. I'm also printing the skewness by array method in order to see if my approach is correct or if I'm doing something wrong.
I tried using the scipy stats function but the result is similar but not equal to the array method (-0.274).
This seems to be the difference between the Pearson kurtosis and Fisher (or excess) kurtosis. According to Wikipedia:
It is common practice to use excess kurtosis, which is defined as Pearson's kurtosis minus 3, to provide a simple comparison to the normal distribution.
When you subtract 3 from the kurtosis value in the summary, you obtain the same value as with scipy.stats.kurtosis
.
In fact, the function scipy.stats.kurtosis
has an option fisher
, which is True
by default, but can be set to False
to get the same result as in the summary:
from scipy.stats import kurtosis
kurtosis(res.resid) # gives -0.2740841793704205
kurtosis(res.resid, fisher=False) # gives +2.7259158206295795
So, my suggestion would be to use scipy.stats.kurtosis
, because it lets you choose explicitly which definition of kurtosis you want.
The pandas function res.resid.kurtosis()
computes Fisher kurtosis, but seems to use a different implementation and thus gives a slightly different value.
I would trust in Scipy.