Search code examples
pythonstatisticsbayesianpymcmcmc

Fitting a Binomial distribution with pymc raises ZeroProbability error for certain FillValues


I'm not sure if I found a bug in pymc. It seems like fitting a Binomial with missing data can produce a ZeroProbability error depending on the chosen fill_value that masks missing data. But maybe I'm using it wrongly. I tried the following example with the current master branch from github. I'm aware of the bug concerning Binomial distributions in pymc 2.3.4, but this seems to be a different issue.

I fitted a Binomial distribution with pymc and everything worked as I expected:

import scipy as sp
import pymc

def make_model(observed_values):
    p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
    values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values),\
                             value = observed_values, observed = True, plot = False)
    values = pymc.Binomial('values', n = 10, p = p,\
                             value = observed_values, observed = True, plot = False)
    return locals()

sp.random.seed(0)
observed_values = sp.random.binomial(n = 10.0, p = 0.1, size = 100)

M1 = pymc.MCMC(make_model(observed_values))
M1.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M1)
M1.summary()

Output:

  [-----------------100%-----------------] 10000 of 10000 complete in 0.7 sec
Plotting p

  p:

          Mean             SD               MC Error        95% HPD interval
          ------------------------------------------------------------------
          0.093            0.007            0.0              [ 0.081  0.107]


          Posterior quantiles:

          2.5             25              50              75             97.5
          |---------------|===============|===============|---------------|
          0.08             0.088           0.093          0.097         0.106

Now, I tried a very similar situation with the difference that one observed value would be missing:

mask = sp.zeros_like(observed_values)
mask[0] = True
masked_values = sp.ma.masked_array(observed_values, mask = mask, fill_value = 999999)

M2 = pymc.MCMC(make_model(masked_values))
M2.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M2)
M2.summary()

Unexpectedly, I got a ZeroProbability error:

---------------------------------------------------------------------------
ZeroProbability                           Traceback (most recent call last)
<ipython-input-16-4f945f269628> in <module>()
----> 1 M2 = pymc.MCMC(make_model(masked_values))
      2 M2.sample(iter=10000, burn=1000, thin=10)
      3 pymc.Matplot.plot(M2)
      4 M2.summary()

<ipython-input-12-cb8707bb911f> in make_model(observed_values)
      4 def make_model(observed_values):
      5     p = pymc.Uniform('p', lower = 0.0, upper = 1.0, value = 0.1)
----> 6     values = pymc.Binomial('values', n = 10* sp.ones_like(observed_values), p = p * sp.ones_like(observed_values),                             value = observed_values, observed = True, plot = False)
      7     values = pymc.Binomial('values', n = 10, p = p,                             value = observed_values, observed = True, plot = False)
      8     return locals()

/home/fabian/anaconda/lib/python2.7/site-packages/pymc/distributions.pyc in __init__(self, *args, **kwds)
    318                     logp_partial_gradients=logp_partial_gradients,
    319                     dtype=dtype,
--> 320                     **arg_dict_out)
    321 
    322     new_class.__name__ = name

/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
    773         if check_logp:
    774             # Check initial value
--> 775             if not isinstance(self.logp, float):
    776                 raise ValueError(
    777                     "Stochastic " +

/home/fabian/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.pyc in get_logp(self)
    930                     (self._value, self._parents.value))
    931             else:
--> 932                 raise ZeroProbability(self.errmsg)
    933 
    934         return logp

ZeroProbability: Stochastic values's value is outside its support,
or it forbids its parents' current values.

However, if I change the fill value in the masked array to 1, fitting works again:

masked_values2 = sp.ma.masked_array(observed_values, mask = mask, fill_value = 1)

M3 = pymc.MCMC(make_model(masked_values2))
M3.sample(iter=10000, burn=1000, thin=10)
pymc.Matplot.plot(M3)
M3.summary()

Output:

[-----------------100%-----------------] 10000 of 10000 complete in 2.1 sec
Plotting p

p:

        Mean             SD               MC Error        95% HPD interval
        ------------------------------------------------------------------
        0.092            0.007            0.0              [ 0.079  0.105]


        Posterior quantiles:

        2.5             25              50              75             97.5
        |---------------|===============|===============|---------------|
        0.079            0.088           0.092          0.097         0.105


values:

        Mean             SD               MC Error        95% HPD interval
        ------------------------------------------------------------------
        1.15             0.886            0.029                  [ 0.  3.]


        Posterior quantiles:

        2.5             25              50              75             97.5
        |---------------|===============|===============|---------------|
        0.0              1.0             1.0            2.0           3.0

Is this a bug or is there a problem with my model? Thanks for any help!


Solution

  • I asked the question the pymc developers on GitHub and there I got an answer from fonnesbeck (https://github.com/pymc-devs/pymc/issues/47#issuecomment-129301002):

    This is because you filled the missing values with 999999, which is outside the support of the variable. You need to give it a valid value, but not one that occurs in the non-missing data."

    To make sure that the missing value is not in the non-missing data the following was suggested:

    You can give it a non-integer value in order to avoid the problem that you cite. For example, if you fill with, say, 1.5 that should work.

    (...) PyMC calculates the log-probability at the first iteration, and therefore the values inserted for the missing values at the first iteration have to be valid. If you give discrete values a floating point value, it should end up getting truncated when converted to integers, so that will work."