Search code examples
gpflow

GPFlow multiple independent realizations of same GP, irregular sampling times/lengths


In GPflow I have multiple time series and the sampling times are not aligned across time series, and the time series may have different length (longitudinal data). I assume that they are independent realizations from the same GP. What is the right way to handle this with svgp, and more generally with GPflow? Do i need to use coregionalization? The coregionalization notebook assumed correlated trajectories, while I want shared mean/kernel but independent.


Solution

  • Yes, the Coregion kernel implemented in GPflow is what you can use for your problem.

    Let's set up some data from the generative model you describe, with different lengths for the timeseries:

    import numpy as np
    import gpflow
    import matplotlib.pyplot as plt
    
    Ns = [80, 90, 100]  # number of observations for three different realizations
    Xs = [np.random.uniform(0, 10, size=N) for N in Ns]  # observation locations
    
    # three different draws from the same GP:
    k = gpflow.kernels.Matern52(variance=2.0, lengthscales=0.5)  # kernel
    
    Ks = [k(X[:, None]) for X in Xs]
    Ls = [np.linalg.cholesky(K) for K in Ks]
    vs = [np.random.randn(N, 1) for N in Ns]
    fs = [(L @ v).squeeze(axis=-1) for L, v in zip(Ls, vs)]
    

    To actually set up the training data for the gpflow GP model:

    # output indicator for the observations: which timeseries is this?
    os = [o * np.ones(N) for o, N in enumerate(Ns)]  # [0 ... 0, 1 ... 1, 2 ... 2]
    
    # now assemble the three timeseries in single data set:
    allX = np.concatenate(Xs)
    allo = np.concatenate(os)
    allf = np.concatenate(fs)
    X = np.c_[allX, allo]
    Y = allf[:, None]
    assert X.shape == (sum(Ns), 2)
    assert Y.shape == (sum(Ns), 1)
    
    # now let's set up a copy of the original kernel:
    k2 = gpflow.kernels.Matern52(active_dims=[0])  # the same as k above, but with different hyperparameters
    
    # and a Coregionalization kernel that effectively says they are all independent:
    kc = gpflow.kernels.Coregion(output_dim=len(Ns), rank=1, active_dims=[1])
    kc.W.assign(np.zeros(kc.W.shape))
    kc.kappa.assign(np.ones(kc.kappa.shape))
    gpflow.set_trainable(kc, False)  # we want W and kappa fixed
    

    The Coregion kernel defines a covariance matrix B = W Wᵀ + diag(kappa), so by setting W=0 we prescribe zero correlations (independent realizations) and kappa=1 (actually the default) ensures that the variance hyperparameter of the copy of the original kernel remains interpretable.

    Now construct the actual model and optimize hyperparameters:

    k2c = k2 * kc
    
    m = gpflow.models.GPR((X, Y), k2c, noise_variance=1e-5)
    
    opt = gpflow.optimizers.Scipy()
    opt.minimize(m.training_loss, m.trainable_variables, compile=False)
    

    which recovers the initial variance and lengthscale hyperparameters pretty well.

    If you want to predict, you have to provide the extra "output" column in the Xnew argument to m.predict_f(), e.g. as follows:

    Xtest = np.linspace(0, 10, 100)
    Xtest_augmented = np.c_[Xtest, np.zeros_like(Xtest)]
    f_mean, f_var = m.predict_f(Xtest_augmented)
    

    (whether you set the output column to 0, 1, or 2 does not matter, as we set them all to be the same with our choice of W and kappa).

    If your input was more than one-dimensional, you could set active_dims=list(range(X.shape[1] - 1)) for the first kernel(s) and active_dims=[X.shape[1]-1] for the Coregion kernel.