Search code examples
python-2.7machine-learningscikit-learnsvmlibsvm

User defined SVM kernel with scikit-learn


I encounter a problem when defining a kernel by myself in scikit-learn. I define by myself the gaussian kernel and was able to fit the SVM but not to use it to make a prediction.

More precisely I have the following code

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.utils import shuffle
import scipy.sparse as sparse
import numpy as np


digits = load_digits(2)
X, y = shuffle(digits.data, digits.target)

gamma = 1.0


X_train, X_test = X[:100, :], X[100:, :]
y_train, y_test = y[:100], y[100:]

m1 = SVC(kernel='rbf',gamma=1)
m1.fit(X_train, y_train)
m1.predict(X_test)

def my_kernel(x,y):
    d = x - y
    c = np.dot(d,d.T)
    return np.exp(-gamma*c)

m2 = SVC(kernel=my_kernel)
m2.fit(X_train, y_train)
m2.predict(X_test)

m1 and m2 should be the same, but m2.predict(X_test) return the error :

operands could not be broadcast together with shapes (260,64) (100,64)

I don't understand the problem.

Furthermore if x is one data point, the m1.predict(x) gives a +1/-1 result, as expexcted, but m2.predict(x) gives an array of +1/-1... No idea why.


Solution

  • The error is at the x - y line. You cannot subtract the two like that, because the first dimensions of both may not be equal. Here is how the rbf kernel is implemented in scikit-learn, taken from here (only keeping the essentials):

    def row_norms(X, squared=False):
    
        if issparse(X):
            norms = csr_row_norms(X)
        else:
            norms = np.einsum('ij,ij->i', X, X)
    
        if not squared:
            np.sqrt(norms, norms)
        return norms
    
    def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
       """
        Considering the rows of X (and Y=X) as vectors, compute the
        distance matrix between each pair of vectors.
    
        [...]
    
    
        Returns
        -------
        distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2)
       """
        X, Y = check_pairwise_arrays(X, Y)
    
        if Y_norm_squared is not None:
            YY = check_array(Y_norm_squared)
            if YY.shape != (1, Y.shape[0]):
                raise ValueError(
                    "Incompatible dimensions for Y and Y_norm_squared")
        else:
            YY = row_norms(Y, squared=True)[np.newaxis, :]
    
        if X is Y:  # shortcut in the common case euclidean_distances(X, X)
            XX = YY.T
        else:
            XX = row_norms(X, squared=True)[:, np.newaxis]
    
        distances = safe_sparse_dot(X, Y.T, dense_output=True)
        distances *= -2
        distances += XX
        distances += YY
        np.maximum(distances, 0, out=distances)
    
        if X is Y:
            # Ensure that distances between vectors and themselves are set to 0.0.
            # This may not be the case due to floating point rounding errors.
            distances.flat[::distances.shape[0] + 1] = 0.0
    
        return distances if squared else np.sqrt(distances, out=distances)
    
    def rbf_kernel(X, Y=None, gamma=None):
    
        X, Y = check_pairwise_arrays(X, Y)
        if gamma is None:
            gamma = 1.0 / X.shape[1]
    
        K = euclidean_distances(X, Y, squared=True)
        K *= -gamma
        np.exp(K, K)    # exponentiate K in-place
        return K
    

    You might want to dig deeper into the code, but look at the comments for the euclidean_distances function. A naive implementation of what you're trying to achieve would be this:

    def my_kernel(x,y):
        d = np.zeros((x.shape[0], y.shape[0]))
        for i, row_x in enumerate(x):
            for j, row_y in enumerate(y):
                d[i, j] = np.exp(-gamma * np.linalg.norm(row_x - row_y))
    
        return d