Search code examples
pythonregressionstatsmodels

What test (null hypothesis) does a model's `f_pvalue` correspond to?


What is the null hypothesis behind an OLSResults's f_pvalue attribute? This docstring is not particularly useful.

At first I thought the null hypothesis was that all estimated coefficients are simultaneously zero (including the constant term). However, I am starting to think that the hypothesis being tested is that all estimated parameters except for the constant term are simultaneously zero (i.e. b1 = b2 = ... = bp = 0, excluding b0).

For example, suppose y is an array of targets and X is a numpy matrix of features (a constant term and p features).

# Silly example
from statsmodels.api import OLS
m = OLS(endog=y, exog=X).fit()

# What is being tested here?
print(m.f_pvalue)

Does anyone know what the null hypothesis is?


Solution

  • Thanks to @Josef for clearing things up. As per the documentation:

    F-statistic of the fully specified model.

    Calculated as the mean squared error of the model divided by the mean squared error of the residuals if the nonrobust covariance is used. Otherwise computed using a Wald-like quadratic form that tests whether all coefficients (excluding the constant) are zero.

    And just to prove that this is the case:

    # Libraries
    import numpy as np
    import pandas as pd
    from statsmodels.api import OLS
    from sklearn.datasets import load_boston
    
    # Load target
    y = pd.DataFrame(load_boston()['target'], columns=['price'])
    
    # Load features
    X = pd.DataFrame(load_boston()['data'], columns=load_boston()['feature_names'])
    
    # Add constant
    X['CONST'] = 1
    
    # One feature
    m1 = OLS(endog=y, exog=X[['CONST','CRIM']]).fit()
    print(f'm1 pvalue: {m1.f_pvalue}')
    
    # Multiple features
    m2 = OLS(endog=y, exog=X[['CONST','CRIM','AGE']]).fit()
    print(f'm2 pvalue: {m2.f_pvalue}')
    
    # Manually test H0: all coefficients are zero (excluding b0)
    print('Manual F-test for m1', m1.f_test(r_matrix=np.matrix([[0,0],[0,1]])),
          'Manual F-test for m2', m2.f_test(r_matrix=np.matrix([[0,0,0],[0,1,0],[0,0,1]])),
          sep='\n')
    
    # Output
    """
    > m1 pvalue: 1.1739870821944483e-19
    > m2 pvalue: 2.2015246345918656e-27
    > Manual F-test for m1
    > <F test: F=array([[89.48611476]]), p=1.1739870821945733e-19, df_denom=504, df_num=1>
    > Manual F-test for m2
    > <F test: F=array([[69.51929476]]), p=2.2015246345920063e-27, df_denom=503, > df_num=2>
    """
    

    So yes, f_pvalue matches the p-value of manually entering the null hypothesis.