Search code examples
pythondata-sciencegpy

Most significant input dimensions for GPy.GPCoregionalizedRegression?


I have trained successfully a multi-output Gaussian Process model using an GPy.models.GPCoregionalizedRegression model of the GPy package. The model has ~25 inputs and 6 outputs.

The underlying kernel is an GPy.util.multioutput.ICM kernel consisting of an RationalQuadratic kernel GPy.kern.RatQuad and the GPy.kern.Coregionalize Kernel.

I am now interested in the feature importance on each individual output. The RatQuad kernel provides an ARD=True (Automatic Relevance Determination) keyword, which allows to get the feature importance of its output for a single output model (which is also exploited by the get_most_significant_input_dimension() method of the GPy model).

However, calling the get_most_significant_input_dimension() method on the GPy.models.GPCoregionalizedRegression model gives me a list of indices I assume to be the most significant inputs somehow for all outputs.

How can I calculate/obtain the lengthscale values or most significant features for each individual output of the model?


Solution

  • The problem is the model itself. The intrinsic coregionalized model (ICM) is set up such, that all outputs are determined by a shared underlying "latent" Gaussian Process. Thus, calling get_most_significant_input_dimension() on a GPy.models.GPCoregionalizationRegression model can only give you one set of input dimensions significant to all outputs together.

    The solution is to use a GPy.util.multioutput.LCM model kernel, which is defined as a sum of ICM kernels with a list of individual (latent) GP kernels. It works as follows

    import GPy
    
    # Your data
    # x = ...
    # y = ...
    
    # # ICM case
    # kernel = GPy.util.multioutput.ICM(input_dim=x.shape[1],
    #                                   num_outputs=y.shape[1],                                                   
    #                                   kernel=GPy.kern.RatQuad(input_dim=x.shape[1], ARD=True))
    
    # LCM case
    k_list = [GPy.kern.RatQuad(input_dim=x.shape[1], ARD=True) for _ in range(y.shape[1])]
    kernel = GPy.util.multioutput.LCM(input_dim=x.shape[1], num_outputs=y.shape[1],
                                                  W_rank=rank, kernels_list=k_list)
    

    A reshaping is of the data is needed (This is also necessary for the ICM model and thus independent of the scope of this questions, see here for details)

    # Reshaping data to fit GPCoregionalizedRegression  
    xx = reshape_for_coregionalized_regression(x)
    yy = reshape_for_coregionalized_reshaping(y)
    
    m = GPy.models.GPCoregionalizedRegression(xx, yy, kernel=kernel)
    m.optimize()
    

    After converged optimization one can call get_most_significant_input_dimension() on an individual latent GPs (here output 0).

    sig_inputs_0 = m.sum.ICM0.get_most_significant_input_dimensions()
    

    or looping over all kernels

    sig_inputs = []
    for part in self.gpy_model.kern.parts:
        sig_inputs.append(part.get_most_significant_input_dimensions())