I would like to use my PyMC3 LR model to get an 80% HPD range for the value of the predicted variable y
as new data becomes available.
Thus, extrapolate a credible distribution of values for y
for a new value of x
not in my original dataset.
Model:
with pm.Model() as model_tlr:
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10)
epsilon = pm.Uniform('epsilon', 0, 25)
nu = pm.Deterministic('nu', pm.Exponential('nu_', 1/29) + 1)
mu = pm.Deterministic('mu', alpha + beta * x)
yl = pm.StudentT('yl', mu=mu, sd=epsilon, nu=nu, observed=y)
trace_tlr = pm.sample(50000, njobs=3)
After burnin I sample from the posterior and get an HPD
ppc_tlr = pm.sample_ppc(btrace_tlr, samples=10000, model=model_tlr)
ys = ppc_tlr['yl']
y_hpd = pm.stats.hpd(ys, alpha=0.2)
Which is great for visualizing the HPD around the central tendency (using fill_between)
But I would like to now use the model to get an HPD for y
when x=126.2
(for example) and the initial dataset didn't contain an observed x=126.2
The way I understand sampling from the posterior is that there are 10k samples for each of the available x
values in the dataset and hence there isn't a corresponding sampling in ys
for x=126.2
as it wasn't observed.
Basically, is there a way to use my model to obtain a distribution of credible values (based on the model) from a predictor value x=126.2
which only became available after the model was built?
If so, how?
Thank you
EDIT:
Found SO Post which mentions
Function under development (will likely eventually get added to pymc3) that will allow to predict posteriors for new data.
Does this exist?
OK, so it's possible, more or less as described in the above SO post. However, there has since been a sample_ppc function added to PyMC3 which makes the author's run_ppc redundant.
First, setup a Theano shared variable for x.
from theano import shared
x_shared = shared(x)
Then use x_shared when building the model.
After the model is built, add the new datum and update the shared variable
x_updated = np.append(x, 126.2)
x_shared.set_value(x_updated)
Re-run the PPC sample generator with the original trace and model objects
new_ppc = pm.sample_ppc(btrace_tlr, samples=10000, model=model_tlr)
The sampling of the posterior for the new datum is found with
sample = new_ppc['yl'][:,-1]
I can then get the HPD with
pm.stats.hpd(sample)
array([ 124.56126638, 128.63795388])
Sklearn has spoiled me into thinking there should be a simple predict
interface...