How to create a custom Kernel for a Gaussian Process Regressor in scikit-learn?

I am looking into using a GPR for a rather peculiar context, where I need to write my own Kernel. However I found out there's no documentation about how to do this. Trying to simply inherit from Kernel and implementing the methods __call__, get_params, diag and is_stationary is enough to get the fitting process to work, but then breaks down when I try to predict y values and standard deviations. What are the necessary steps to build a minimal but functional class that inherits from Kernel while using its own function? Thanks!

Solution

Depending on how exotic your kernel will be, the answer to your question may be different.

I find the implementation of the RBF kernel quite self-documenting, so I use it as reference. Here is the gist:

class RBF(StationaryKernelMixin, NormalizedKernelMixin, Kernel):
    def __init__(self, length_scale=1.0, length_scale_bounds=(1e-5, 1e5)):
        self.length_scale = length_scale
        self.length_scale_bounds = length_scale_bounds

    @property
    def hyperparameter_length_scale(self):
        if self.anisotropic:
            return Hyperparameter("length_scale", "numeric",
                                  self.length_scale_bounds,
                                  len(self.length_scale))
        return Hyperparameter(
            "length_scale", "numeric", self.length_scale_bounds)

    def __call__(self, X, Y=None, eval_gradient=False):
        # ...

As you mentioned, your kernel should inherit from Kernel, which requires you to implement __call__, diag and is_stationary. Note, that sklearn.gaussian_process.kernels provides StationaryKernelMixin and NormalizedKernelMixin, which implement diag and is_stationary for you (cf. RBF class definition in the code).

You should not overwrite get_params! This is done for you by the Kernel class and it expects scikit-learn kernels to follow a convention, which your kernel should too: specify your parameters in the signature of your constructor as keyword arguments (see length_scale in the previous example of RBF kernel). This ensures that your kernel can be copied, which is done by GaussianProcessRegressor.fit(...) (this could be the reason that you could not predict the standard deviation).

At this point, you may notice the other parameter length_scale_bounds. That is only a constraint on the actual hyper parameter length_scale (cf. constrained optimization). This brings us to the fact, that you need to also declare your hyper parameters, that you want optimized and need to compute gradients for in your __call__ implementation. You do that by defining a property of your class that is prefixed by hyperparameter_ (cf. hyperparameter_length_scale in the code). Each hyper parameter that is not fixed (fixed = hyperparameter.fixed == True) is returned by Kernel.theta, which is used by GP on fit() and to compute the marginal log likelihood. So this is essential if you want to fit parameters to your data.

One last detail about Kernel.theta, the implementation states:

Returns the (flattened, log-transformed) non-fixed hyperparameters.

So you should be careful with 0 values in your hyper parameters as they can end up as np.nan and break stuff.

I hope this helps, even though this question is already a bit old. I have actually never implemented a kernel myself, but was eager to skim the sklearn code base. It is unfortunate that there are no official tutorials on that, the code base, however, is quite clean and commented.