In the tutorial on multiclass classification on the GPflow website, a Sparse Variational Gaussian Process (SVGP) is used on a 1D toy example. As is the case for all other GPflow models, the SVGP model has a method predict_y(self, Xnew)
which returns the mean and variance of held-out data at the points Xnew
.
From the tutorial it is clear that the first argument that is unpacked from predict_y
is the posterior predictive probability of each of the three classes (cells [7]
and [8]
), shown as the colored lines in the second panel of the plot below. However, the authors do not elaborate on the second argument that can be unpacked from predict_y
, which are the variances of the predictions. In a regression setting its interpretation is clear to me, as the posterior predictive distribution in this case would be a Gaussian.
But I fail to understand what could be the interpretation here. Especially, I would like to know how this measure could be used to construct error bars denoting uncertainty around class predictions for any new data point.
I altered the code of the tutorial slightly to add an additional panel to the plot below: the third panel shows in black the maximal standard deviation (square root of the obtained variance from predict_y
). It clearly is a good measure for uncertainty and it is probably also no coincidence that the highest possible value is 0.5, but I could not find how it is calculated and what it represents.
Complete notebook with all code here.
def plot(m):
f = plt.figure(figsize=(12,8))
a1 = f.add_axes([0.05, 0.05, 0.9, 0.5])
av = f.add_axes([0.05, 0.6, 0.9, 0.1])
a2 = f.add_axes([0.05, 0.75, 0.9, 0.1])
a3 = f.add_axes([0.05, 0.9, 0.9, 0.1])
xx = np.linspace(m.X.read_value().min()-0.3, m.X.read_value().max()+0.3, 200).reshape(-1,1)
mu, var = m.predict_f(xx)
mu, var = mu.copy(), var.copy()
p, v = m.predict_y(xx)
a3.set_xticks([])
a3.set_yticks([])
av.set_xticks([])
lty = ['-', '--', ':']
for i in range(m.likelihood.num_classes):
x = m.X.read_value()[m.Y.read_value().flatten()==i]
points, = a3.plot(x, x*0, '.')
color=points.get_color()
a1.fill_between(xx[:,0], mu[:,i] + 2*np.sqrt(var[:,i]), mu[:,i] - 2*np.sqrt(var[:,i]), alpha = 0.2)
a1.plot(xx, mu[:,i], color=color, lw=2)
a2.plot(xx, p[:,i], '-', color=color, lw=2)
av.plot(xx, np.sqrt(np.max(v[:,:], axis = 1)), c = "black", lw=2)
for ax in [a1, av, a2, a3]:
ax.set_xlim(xx.min(), xx.max())
a2.set_ylim(-0.1, 1.1)
a2.set_yticks([0, 1])
a2.set_xticks([])
plot(m)
Model.predict_y()
calls Likelihood.predict_mean_and_var()
. If you look at the documentation of the latter function [1] you see that all it does, is compute the mean and variance of the predictive distribution. I.e., we first compute the marginal predictive distribution q(y) = \int p(y|f) q(f) df
, and then we compute the mean and variance of q(y)
.
For a Gaussian, the mean and variance can be specified independently of each other, and they have interpretations as a point prediction and the uncertainty. For a Bernoulli likelihood, the mean and variance are both completely determined by the single parameter p
. The mean of the distribution is the probability of the event, which already tells us the uncertainty! The variance doesn't give much more.
However, you are right that the variance is a nice metric of uncertainty where higher means more uncertainty. The entropy as a function of p
looks very similar (although the two differ in behaviour near the edges):
p = np.linspace(0.001, 1 - 0.001, 1000)[:, None]
q = 1 - p
plt.plot(p, -p * np.log(p) - q * np.log(q), label='entropy')
plt.plot(p, p * q, label='variance')
plt.legend()
plt.xlabel('probability')