python bayesian survival-analysis mcmc pymc3

MCMC convergence in hierarchical model with (large) time^2 term in pymc3

I have a hierarchical logit that has observations over time. Following Carter 2010, I have included a time, time^2, and time^3 term. The model mixes using Metropolis or NUTS before I add the time variables. HamiltonianMC fails. NUTS and Metropolis also work with time. But NUTS and Metropolis fail with time^2 and time^3, but they fail differently and in a puzzling way. However, unlike in other models that fail for more obvious model specification reasons, ADVI still gives an estimate, (and ELBO is not inf).

NUTS either stalls early (last time it made it to 60 iterations), or progresses too quickly and returns an empty traceplot with an error about varnames.
Metropolis errors out immediately with a dimension mismatch error. It looks like the one in this github error, but I'm using a Bernoulli outcome, not a negative binomial. The end of the error looks like: ValueError: Input dimension mis-match. (input[0].shape[0] = 1, input[4].shape[0] = 18)
I get an empty trace when I try HamiltonianMC. It returns the starting values with no mixing
ADVI gives me a mean and a standard deviation.
I increased the ADVI iterations by a lot. It gave pretty close to the same starting points and NUTS still failed.
I double checked that the fix for the github issue is in place in the version of pymc3 I'm running. It is.

My intuition is that this has something to do with how huge the time^2 and time^3 variables get, since I'm looking over a large time-frame. Time^3 starts at 0 and goes to 64,000.

Here is what I've tried for sampling so far. Note that I have small sample sizes while testing, since it takes so long to run (if it finishes) and I'm just trying to get it to sample at all. Once I find one that works, I'll up the number of iterations

with my_model:
    mu,sds,elbo = pm.variational.advi(n=500000,learning_rate=1e-1)
    print(mu['mu_b'])
    step = pm.NUTS(scaling=my_model.dict_to_array(sds)**2,
                   is_cov=True)
    my_trace = pm.sample(500, 
                         step=step, 
                         start=mu,
                         tune=100)

I've also done the above with tune=1000

I've also tried a Metropolis and Hamiltonian.

with my_model:
    my_trace = pm.sample(5000,step=pm.Metropolis())

with my_model:
    my_trace = pm.sample(5000,step=pm.HamiltonianMC())

Questions:

Is my intuition about the size and spread of the time variables reasonable?
Are there ways to sample square and cubed terms more effectively? I've searched, but can you perhaps point me to a resource on this so I can learn more about it?
What's up with Metropolis and the dimension mismatch error?
What's up with the empty trace plots for NUTS? Usually when it stalls, the trace up until the stall works.
Are there alternative ways to handle time that might be easier to sample?

I haven't posted a toy model, because it's hard to replicate without the data. I'll add a toy model once I replicate with simulated data. But the actual model is below:

with pm.Model() as my_model:
    mu_b = pm.Flat('mu_b')
    sig_b = pm.HalfCauchy('sig_b',beta=2.5)
    b_raw = pm.Normal('b_raw',mu=0,sd=1,shape=n_groups)
    b = pm.Deterministic('b',mu_b + sig_b*b_raw)

    t1 = pm.Normal('t1',mu=0,sd=100**2,shape=1)
    t2 = pm.Normal('t2',mu=0,sd=100**2,shape=1)
    t3 = pm.Normal('t3',mu=0,sd=100**2,shape=1)

    est =(b[data.group.values]* data.x.values) +\
        (t1*data.t.values)+\
        (t2*data.t2.values)+\
        (t3*data.t3.values)

    y = pm.Bernoulli('y', p=tt.nnet.sigmoid(est), observed = data.y)

BREAKTHROUGH 1: Metropolis error

Weird syntax issue. Theano seemed to be confused about a model with both constant and random effects. I created a constant in data equal to 0, data['c']=0 and used it as an index for the time, time^2 and time^3 effects, as follows:

est =(b[data.group.values]* data.x.values) +\
            (t1[data.c.values]*data.t.values)+\
            (t2[data.c.values]*data.t2.values)+\
            (t3[data.c.values]*data.t3.values)

I don't think this is the whole issue, but it's a step in the right direction. I bet this is why my asymmetric specification didn't work, and if so, suspect it may sample better.

UPDATE: It sampled! Will now try some of the suggestions for making it easier on the sampler, including using the specification suggested here. But at least it's working!

Solution

Without having the dataset to play with it is hard to give a definite answer, but here is my best guess:

To me, it is a bit unexpected to hear about the third order polynomial in there. I haven't read the paper, so I can't really comment on it, but I think this might be the reason for your problems. Even very small values for t3 will have a huge influence on the predictor. To keep this reasonable, I'd try to change the parametrization a bit: First make sure that your predictor is centered (something like data['t'] = data['t'] - data['t'].mean() and after that define data.t2 and data.t3). Then try to set a more reasonable prior on t2 and t3. They should be pretty small, so maybe try something like

t1 = pm.Normal('t1',mu=0,sd=1,shape=1)
t2 = pm.Normal('t2',mu=0,sd=1,shape=1)
t2 = t2 / 100
t3 = pm.Normal('t3',mu=0,sd=1,shape=1)
t3 = t3 / 1000

If you want to look at other models, you could try to model your predictor as a GaussianRandomWalk or a Gaussian Process.

Updating pymc3 to the latest release candidate should also help, the sampler was improved quit a bit.

Update I just noticed you don't have an intercept term in your model. Unless there is a good reason for that you probably want to add

intercept = pm.Flat('intercept')
est = (intercept
       + b[..] * data.x
       + ...)