Search code examples
pythonstatisticsscipymodelingstatsmodels

statsmodels - plotting the fitted distribution


The following code fits a oversimplified generalized linear model using statsmodels

model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()

This gives the coefficient and a stderr:

               coef stderr   
Intercept    2.9471  0.120

Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.

But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.

Is there a way to retrieve the parameters from the results of the fitting? How can I plot the PMF?


Solution

  • Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.

    However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.

    The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2

    Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.

    Something like this should be made available in statsmodels.

    In a few example runs, I got results like this

    data generating parameters 50 0.25
    estimated params           51.7167511571 0.256814610633
    estimated params           50.0985814878 0.249989725917
    

    Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.

    import numpy as np
    from scipy import stats
    import statsmodels.api as sm
    
    # generate some data to check
    nobs = 1000
    n, p = 50, 0.25
    dist0 = stats.nbinom(n, p)
    y = dist0.rvs(size=nobs)
    x = np.ones(nobs)
    
    loglike_method = 'nb1'  # or use 'nb2'
    res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
    
    print dist0.mean()
    print res.params
    
    mu = res.predict()   # use this for mean if not constant
    mu = np.exp(res.params[0])   # shortcut, we just regress on a constant
    alpha = res.params[1]
    
    if loglike_method == 'nb1':
        Q = 1
    elif loglike_method == 'nb2':    
        Q = 0
    
    size = 1. / alpha * mu**Q
    prob = size / (size + mu)
    
    print 'data generating parameters', n, p
    print 'estimated params          ', size, prob
    
    #estimated distribution
    dist_est = stats.nbinom(size, prob)
    

    BTW: I ran into this before but didn't have time to look at it https://github.com/statsmodels/statsmodels/issues/106