python statistics scipy modeling statsmodels

statsmodels - plotting the fitted distribution

The following code fits a oversimplified generalized linear model using statsmodels

model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()

This gives the coefficient and a stderr:

               coef stderr   
Intercept    2.9471  0.120

Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.

But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.

Is there a way to retrieve the parameters from the results of the fitting? How can I plot the PMF?

Solution

Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.

However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.

The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2

Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.

Something like this should be made available in statsmodels.

In a few example runs, I got results like this

data generating parameters 50 0.25
estimated params           51.7167511571 0.256814610633
estimated params           50.0985814878 0.249989725917

Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.

import numpy as np
from scipy import stats
import statsmodels.api as sm

# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)

loglike_method = 'nb1'  # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])

print dist0.mean()
print res.params

mu = res.predict()   # use this for mean if not constant
mu = np.exp(res.params[0])   # shortcut, we just regress on a constant
alpha = res.params[1]

if loglike_method == 'nb1':
    Q = 1
elif loglike_method == 'nb2':    
    Q = 0

size = 1. / alpha * mu**Q
prob = size / (size + mu)

print 'data generating parameters', n, p
print 'estimated params          ', size, prob

#estimated distribution
dist_est = stats.nbinom(size, prob)

BTW: I ran into this before but didn't have time to look at it https://github.com/statsmodels/statsmodels/issues/106