Difference between SVGP and SGPMC implementation

As far as the SGPMC paper[1] goes, the pretraining should be pretty much identical to SVGP. However, the implementations (current dev version) differ a bit, and I'm having some problems understanding everything (especially what happens with the conditionals with q_sqrt=None) due to the dispatch programming style.

Do I see it correctly, that the difference is that q_mu/q_var are now represented by that self.V normal distribution? And the only other change would be that whitening is on per default because it's required for the sampling?

The odd thing is that stochastic optimization (without any sampling yet) of SPGMC seems to work quite a bit better on my specific data than with the SVGP class, which got me a bit confused, since it should basically be the same.

[1]Hensman, James, et al. "MCMC for variationally sparse Gaussian processes." Advances in Neural Information Processing Systems. 2015.

Edit2: In the current dev branch I see that the (negative) training_objective consists basically of: VariationalExp + self.log_prior_density(), whereas the SVGP ELBO would be VariationalExp - KL(q(u)|p(u)).

self.log_prior_density() apparently adds all the prior densities. So the training objective looks like equation (7) of the SGPMC paper (the whitened optimal variational distribution).

So by optimizing the optimal variational approximation to the posterior p(f*,f, u, θ | y), we would be getting the MAP estimation of inducing points?

Solution

There are several elements to your question, I'll try and address them separately:

SVGP vs SGPMC objective: In SVGP, we parametrize a closed-form posterior distribution q(u) by defining it as a normal (Gaussian) distribution with mean q_mu and covariance q_sqrt @ q_sqrt.T. In SGPMC, the distribution q(u) is implicitly represented by samples - V holds a single sample at a time. In SVGP, the ELBO has a KL term that pulls q_mu and q_sqrt towards q(u) = p(u) = N(0, Kuu) (with whitening, q_mu and q_sqrt parametrize q(v), the KL term is driving them towards q(v) = p(v) = N(0, I), and u = chol(Kuu) v). In SGPMC, the same effect comes from the prior on V in the MCMC sampling. This is still reflected when doing MAP optimisation with a stochastic optimizer, but different from the KL term. You can set q_sqrt to zero and non-trainable for the SVGP model, but they still have slightly different objectives. Stochastic optimization in the SGPMC model might give you a better data fit, but this is not a variational optimization so you might be overfitting to your training data.

training_loss: For all GPflow models, model.training_loss includes the log_prior_density. (Just by default the SVGP model parameters do not have any priors set.) The SGPMC training_loss() corresponds to the negative of eq. (7) in the SGPMC paper [1].

Inducing points: By default the inducing points Z do not have a prior, so it would just be maximum likelihood. Note that [1] suggests to keep Z fixed in the SGPMC model (and base it on the optimsed locations in a previously-fit SVGP model).

What happens in conditional() with q_sqrt=None: conditional() computes the posterior distribution of f(Xnew) given the distribution of u; this handles both the case used in (S)VGP, where we have a variational distribution q(u) = N(q_mu, q_sqrt q_sqrt^T), and the noise-free case where "u is known" which is used in (S)GPMC. q_sqrt=None is equivalent to saying "the variance is zero", like a delta spike on the mean, but saving computation.