I am new to PYMC3. Maybe this is a naive question, but I searched a lot and didn't find any explanation on this issue. Basically, I want to do a linear regression in PYMC3, however the training is very slow and the model performance on training set is very poor as well. Below is my code:
X_Tr = np.array([ 13.99802212, 13.8512075 , 13.9531636 , 13.97432944,
13.89211468, 13.91357953, 13.95987483, 13.86476587,
13.9501789 , 13.92698143, 13.9653932 , 14.06663115,
13.91697969, 13.99629862, 14.01392784, 13.96495713,
13.98697998, 13.97516973, 14.01048397, 14.05918188,
14.08342002, 13.89350606, 13.81768849, 13.94942447,
13.90465027, 13.93969029, 14.18771189, 14.08631113,
14.03718829, 14.01836206, 14.06758363, 14.05243539,
13.96287123, 13.93011351, 14.01616973, 14.01923812,
13.97424024, 13.9587175 , 13.85669845, 13.97778302,
14.04192138, 13.93775494, 13.86693585, 13.79985956,
13.82679677, 14.06474544, 13.90821822, 13.71648423,
13.78899668, 13.76857337, 13.87201756, 13.86152949,
13.80447525, 13.99609891, 14.0210165 , 13.986906 ,
13.97479211, 14.04562055, 14.03293095, 14.15178043,
14.32413197, 14.2330354 , 13.99247751, 13.92962912,
13.95394525, 13.87888254, 13.82743111, 14.10724699,
14.23638905, 14.15731881, 14.13239278, 14.13386722,
13.91442452, 14.01056255, 14.19378649, 14.22233852,
14.30405399, 14.25880108, 14.23985258, 14.21184303,
14.4443183 , 14.55710331, 14.42102092, 14.29047616,
14.43712609, 14.58666212])
y_tr = np.array([ 13.704, 13.763, 13.654, 13.677, 13.66 , 13.735, 13.845,
13.747, 13.747, 13.606, 13.819, 13.867, 13.817, 13.68 ,
13.823, 13.779, 13.814, 13.936, 13.956, 13.912, 13.982,
13.979, 13.919, 13.944, 14.094, 13.983, 13.887, 13.902,
13.899, 13.881, 13.784, 13.909, 13.99 , 14.06 , 13.834,
13.778, 13.703, 13.965, 14.02 , 13.992, 13.927, 14.009,
13.988, 14.022, 13.754, 13.837, 13.91 , 13.907, 13.867,
14.014, 13.952, 13.796, 13.92 , 14.051, 13.773, 13.837,
13.745, 14.034, 13.923, 14.041, 14.077, 14.125, 13.989,
14.174, 13.967, 13.952, 14.024, 14.171, 14.175, 14.091,
14.267, 14.22 , 14.071, 14.112, 14.174, 14.289, 14.146,
14.356, 14.5 , 14.265, 14.259, 14.406, 14.463, 14.473,
14.413, 14.507])
sns.regplot(x=X_tr, y=y_tr.flatten());
Here I use PYMC3 to train the model:
shA_X = shared(X_tr)
with pm.Model() as linear_model:
alpha = pm.Normal("alpha", mu=14,sd=100)
betas = pm.Normal("betas", mu=0, sd=100, shape=1)
sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
mu = alpha + betas * shA_X
forecast = pm.Normal("forecast", mu=mu, sd=sigma, observed=y_tr)
step = pm.NUTS()
trace=pm.sample(3000, tune=1000)
and then check the performance:
ppc_w = pm.sample_ppc(trace, 1000, linear_model,
progressbar=False)
plt.plot(ppc_w['forecast'].mean(axis=0),'r')
plt.plot(y_tr, color='k')`
Why is the prediction on the training set so poor? Any suggestions and ideas are appreciated.
This model is doing a fine job - I think the confusion is over how to handle the PyMC3
objects (thank you for the easy to work with example, though!). In general, PyMC3
will be used to quantify the uncertainty in your model.
For example, trace['betas'].mean()
is around 0.83 (this will depend on your random seed), while the least squares estimate that, for example, sklearn
gives will be 0.826. Similarly, trace['alpha'].mean()
gives 2.34, while the "true" value is 2.38.
You can also use the trace to plot many different plausible draws for your line of best fit:
for draw in trace[::100]:
pred = draw['betas'] * X_tr + draw['alpha']
plt.plot(X_tr, pred, '--', alpha=0.2, color='grey')
plt.plot(X_tr, y_tr, 'o');
Note that these are drawn from the distribution of "best fits" for your data. You also used sigma
to model the noise, and you can plot this value as well:
for draw in trace[::100]:
pred = draw['betas'] * X_tr + draw['alpha']
plt.plot(X_tr, pred, '-', alpha=0.2, color='grey')
plt.plot(X_tr, pred + draw['sigma'], '-', alpha=0.05, color='red')
plt.plot(X_tr, pred - draw['sigma'], '-', alpha=0.05, color='red');
plt.plot(X_tr, y_tr, 'o');
Using sample_ppc
draws observed values from your posterior distribution, so each row of ppc_w['forecast']
is a reasonable way for the data to be generated "next time". You might use that object this way:
ppc_w = pm.sample_ppc(trace, 1000, linear_model,
progressbar=False)
for draw in ppc_w['forecast'][::5]:
sns.regplot(X_tr, draw, scatter_kws={'alpha': 0.005, 'color': 'r'}, fit_reg=False)
sns.regplot(X_tr, y_tr, color='k', fit_reg=False);