Search code examples
bayesianpymc

Creating a three-level logistic regression model in pymc3


I'm attempting to create a three-level logistic regression model in pymc3. There is a top level, mid level, and an individual level, where the mid-level coefficients are estimated from top-level coefficients. I'm having difficulty specifying the proper data structure for the mid level, however.

Here's my code:

with pm.Model() as model:
    # Hyperpriors
    top_level_tau = pm.HalfNormal('top_level_tau', sd=100.)
    mid_level_tau = pm.HalfNormal('mid_level_tau', sd=100.)    

    # Priors
    top_level = pm.Normal('top_level', mu=0., tau=top_level_tau, shape=k_top)
    mid_level = [pm.Normal('mid_level_{}'.format(j),
                           mu=top_level[mid_to_top_idx[j]],
                           tau=mid_level_tau)
                 for j in range(k_mid)]

    intercept = pm.Normal('intercept', mu=0., sd=100.)

    # Model prediction
    yhat = pm.invlogit(mid_level[mid_to_bot_idx] + intercept)

    # Likelihood
    yact = pm.Bernoulli('yact', p=yhat, observed=y)

I'm getting the error "only integer arrays with one element can be converted to an index" (on line 16), which I think is related to the fact that the mid_level variable is a list, not a proper pymc Container. (I don't see the Container class in the pymc3 source code, either.)

Any help would be appreciated.

Edit: Adding some mock data

y = np.array([0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
mid_to_bot_idx = np.array([0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 3, 3, 3, 2, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 2])
mid_to_top_idx = np.array([0, 0, 1, 1])
k_top = 2
k_mid = 4

Edit #2:

There seem to be a few different ways to solve this issue, although none are completely satisfactory:

1) One can reframe the model as:

with pm.Model() as model:
    # Hyperpriors
    top_level_tau = pm.HalfNormal('top_level_tau', sd=100.)
    mid_level_tau = pm.HalfNormal('mid_level_tau', sd=100.)    

    # Priors
    top_level = pm.Normal('top_level', mu=0., tau=top_level_tau, shape=k_top)
    mid_level = pm.Normal('mid_level', mu=0., tau=mid_level_tau, shape=k_top)
    intercept = pm.Normal('intercept', mu=0., sd=100.)

    # Model prediction
    yhat = pm.invlogit(top_level[top_to_bot_idx] + mid_level[mid_to_bot_idx] + intercept)

    # Likelihood
    yact = pm.Bernoulli('yact', p=yhat, observed=y)

This seems to work, although I can't figure out how to extend it to the case where the mid-level variance is not constant for all of the mid-level groups.

2) One can wrap the mid-level coefficients into a Theano tensor using theano.tensor.stack: i.e.,

import theano.tensor as tt
mid_level = tt.stack([pm.Normal('mid_level_{}'.format(j),
                           mu=top_level[mid_to_top_idx[j]],
                           tau=mid_level_tau)
                 for j in range(k_mid)])

But this seems to run very slowly on my actual data set (30k observations), and it makes plotting inconvenient (each of the mid_level coefficients gets its own trace using pm.traceplot).

Anyway, some advice/input from the developers would be appreciated.


Solution

  • Turns out the solution was simple: it appears that any distribution (like pm.Normal) can accept a vector of means as an argument, so replacing this line

    mid_level = [pm.Normal('mid_level_{}'.format(j),
                           mu=top_level[mid_to_top_idx[j]],
                           tau=mid_level_tau)
                 for j in range(k_mid)]
    

    with this

    mid_level = pm.Normal('mid_level',
                           mu=top_level[mid_to_top_idx],
                           tau=mid_level_tau,
                           shape=k_mid)
    

    works. The same method can also be used to specify individual standard deviations for each of the mid-level groups.

    EDIT: Fixed typo