Search code examples
linear-regressionpymc3

Basic Bayesian Linear Regression prediction with PyMC3


I would like to use my PyMC3 LR model to get an 80% HPD range for the value of the predicted variable y as new data becomes available. Thus, extrapolate a credible distribution of values for y for a new value of x not in my original dataset.

Model:

with pm.Model() as model_tlr:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10)
    epsilon = pm.Uniform('epsilon', 0, 25)

    nu = pm.Deterministic('nu', pm.Exponential('nu_', 1/29) + 1)
    mu = pm.Deterministic('mu', alpha + beta * x)

    yl = pm.StudentT('yl', mu=mu, sd=epsilon, nu=nu, observed=y)

    trace_tlr = pm.sample(50000, njobs=3)

After burnin I sample from the posterior and get an HPD

ppc_tlr = pm.sample_ppc(btrace_tlr, samples=10000, model=model_tlr)
ys = ppc_tlr['yl']
y_hpd = pm.stats.hpd(ys, alpha=0.2)

Which is great for visualizing the HPD around the central tendency (using fill_between) enter image description here

But I would like to now use the model to get an HPD for y when x=126.2 (for example) and the initial dataset didn't contain an observed x=126.2

The way I understand sampling from the posterior is that there are 10k samples for each of the available x values in the dataset and hence there isn't a corresponding sampling in ys for x=126.2 as it wasn't observed.

Basically, is there a way to use my model to obtain a distribution of credible values (based on the model) from a predictor value x=126.2 which only became available after the model was built? If so, how?

Thank you

EDIT:
Found SO Post which mentions

Function under development (will likely eventually get added to pymc3) that will allow to predict posteriors for new data.

Does this exist?


Solution

  • OK, so it's possible, more or less as described in the above SO post. However, there has since been a sample_ppc function added to PyMC3 which makes the author's run_ppc redundant.

    First, setup a Theano shared variable for x.

    from theano import shared
    x_shared = shared(x)
    

    Then use x_shared when building the model.

    After the model is built, add the new datum and update the shared variable

    x_updated = np.append(x, 126.2)
    x_shared.set_value(x_updated)
    

    Re-run the PPC sample generator with the original trace and model objects

    new_ppc = pm.sample_ppc(btrace_tlr, samples=10000, model=model_tlr)
    

    The sampling of the posterior for the new datum is found with

    sample = new_ppc['yl'][:,-1]
    

    I can then get the HPD with

    pm.stats.hpd(sample)
    

    array([ 124.56126638, 128.63795388])

    Sklearn has spoiled me into thinking there should be a simple predict interface...