Search code examples
pythonstatsmodelsglm

Trying to get a Gaussian GLM to simply match OLS but is either very wrong or has perfect separation


I'm trying to make a GLM do what an OLS does, just to get a baseline understanding of GLM. But it doesn't seem to do what I'm after. Consider this code:

import numpy as np
import statsmodels.api as sm
from scipy import stats

print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
aInput = np.arange(10)
print(aInput)
aLinear = aInput.copy() * 0.25 + 5
print(aLinear)

print("### This is OLS to show clearly what we're after: ####################")
aInputConst = sm.add_constant(aInput)
model = sm.OLS(aLinear, aInputConst)
results = model.fit()
print(results.params)

print("This is GLM which looks nothing like what I expect: ##################")
model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)

print("This is GLM with the constant, but it just fails: ####################")
#                       vvvvvvvvvvv
model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)

Now consider the output:

### This is dummy data that clearly is y = 0.25 * x + 5: #############
[0 1 2 3 4 5 6 7 8 9]
[5.   5.25 5.5  5.75 6.   6.25 6.5  6.75 7.   7.25]
### This is OLS to show clearly what we're after: ####################
[5.   0.25]
This is GLM which looks nothing like what I expect: ##################
[0.         1.03947368 2.07894737 3.11842105 4.15789474 5.19736842
 6.23684211 7.27631579 8.31578947 9.35526316]
This is GLM with the constant, but it just fails: ####################
Traceback (most recent call last):
  File "min.py", line 26, in <module>
    result = model.fit()
  File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1065, in fit
    cov_kwds=cov_kwds, use_t=use_t, **kwargs)
  File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1211, in _fit_irls
    raise PerfectSeparationError(msg)
statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

Observe:

  • The data is contrived to be y = 0.25 x + 5 so that we know the "right" answer when we see it.
  • The OLS finds this naturally. It finds the 5 and it finds the 0.25. Easy enough.
  • But the first attempt with the GLM just goes from roughly 0 to 10, when I would expect it to go from about 5 to about 7.25 which is what the original outputs are. Why didn't it? Now, one thing I've read is you need to add the constant, which leads to the next issue:
  • When I add the constant, it get the "perfect separation" error. If I'm supposed to add that constant, how can I add it without getting the error?

My question is:

  • How can I just get the GLM to behave like an OLS and tell me 5 and 0.25. I'm just trying to get this as a starting baseline but can't. That would either be by getting the first GLM call to have b1 and b0, as well as the right range of output; or by adding the constant if I'm supposed to, then providing b1 and b0.

I'm clearly confused. All help appreciated!


Solution

  • Got it! Two things:

    • Yes, you do need that "sm.add_constant()" call. As noted on a number of sites I read, it's a strange behavior of both StatsModels and Sklearn, but if you want an intercept value, you need a constant input variable.

    • And the issue was the dummy date was just too good. By adding some noise to the data so that it jitters around the regression line it works fine. Apparently, if the data is to perfect, the machine has a hard time fitting it. I don't know why. Just my empirical observation.

    The working code with comments on the changes is:

    import numpy as np
    import statsmodels.api as sm
    from scipy import stats
    
    print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
    aInput = np.arange(10)
    print(aInput)
    noise = (np.random.rand(10) - 0.5) / 10 + 1 # New.
    aLinear = aInput.copy() * 0.25 * noise + 5 # Changed: add noise.
    print(aLinear)
    
    print("### This is OLS to show clearly what we're after: ####################")
    aInputConst = sm.add_constant(aInput)
    model = sm.OLS(aLinear, aInputConst)
    results = model.fit()
    print(results.params)
    
    #   print("This is GLM which looks nothing like what I expect: ##################")
    #   model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
    #   result = model.fit()
    #   y_hat = result.predict(aInput)
    #   print(y_hat)
    
    print("This is GLM with the constant, but it just fails: ####################")
    #                       vvvvvvvvvvv
    model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
    result = model.fit()
    y_hat = result.predict(aInputConst) # Changed: use the "Const" version.
    print(y_hat)
    
    

    And the output is:

    ### This is dummy data that clearly is y = 0.25 * x + 5: #############
    [0 1 2 3 4 5 6 7 8 9]
    [5.         5.25190469 5.48057897 5.74306435 6.00161857 6.20957869
     6.43812791 6.71636422 7.09118653 7.25882849]
    ### This is OLS to show clearly what we're after: ####################
    [4.98249325 0.25258489]
    This is GLM with the constant, but it just fails: ####################
    [4.98249325 5.23507814 5.48766302 5.74024791 5.9928328  6.24541768
     6.49800257 6.75058746 7.00317235 7.25575723]
    
    

    And as noted, the one thing I was after is seeing the coefficients which are contrived in the data, and sure enough, they come out right:

    [4.98249325 0.25258489]
    

    So yes, you need the constant column. And if you're test data is too clean, jitter it a bit so it's more like messy real-world data.