python arrays numpy matrix vectorization

Fastest way to calculate the similarity kernel between matrices of strings?

Say I have a list of 5 matrices called data. Each matrix has an arbitrary number of rows but exactly 3 columns that contain 3 strings. I want to train a Gaussian Process model and let's say data is my training set. I want to calculate the similarity kernel based on the string matching of every pair of matrices. Let's say the 5 matrices look like below:

import numpy as np

data = np.array([# Matrix 0
                 [['b', 'a', 'c'], 
                  ['a', 'a', 'b'], 
                  ['d', 'c', 'c'], 
                  ['a', 'b', 'd']],
             # Matrix 1
             [['d', 'a', 'c'], 
              ['a', 'b', 'c'], 
              ['a', 'd', 'd']],
         # Matrix 2
         [['a', 'b', 'b'], 
          ['d', 'a', 'd'], 
          ['d', 'b', 'a'], 
          ['c', 'b', 'd']],
     # Matrix 3
     [['b', 'b', 'c'], 
      ['a', 'b', 'b'], 
      ['a', 'c', 'a'], 
      ['c', 'b', 'a'], 
      ['b', 'd', 'd']],
 # Matrix 4
 [['a', 'b', 'c'], 
  ['c', 'a', 'b'], 
  ['d', 'd', 'c'], 
  ['a', 'a', 'a']]
], dtype=object)

I want to calculate the similarity between every two matrices. Let's use the first two matrices as an example. They have 4 and 3 rows, respectively. I want to check the string matching of all 4 x 3 pairs. In each pair, we say the difference between each pair of strings (only compares 0-0, 1-1, 2-2) is 0 if they are the same, otherwise 1. This will returns a binary vector diff and then be fed into a squared exponential (or RBF) kernel to get the similarity score between this pair of rows. Then I calculate the similarity scores between all pairs of rows (0-0, 0-1, 0-2, 1-0, ..., 3-0, 3-1, 3-2) and sum them together, and this value is the final similarity between the first and second matrices. I then do this for all pairs of matrices, and I can get a final similarity kernel R and its normalized version K. Below is my implementation:

from itertools import product
import math

def kernel(data, sigma=1.):
    # Initialize the similarity kernel
    R = np.zeros(shape=(len(data), len(data)))

    # Get every pair of matrices (including themselves)
    for iprod in list(product(enumerate(data), enumerate(data))):
        idxs, prod = zip(*[(i, c) for i, c in iprod])
        ks = []

        # Get every pair of rows between the two matrices
        for pair in list(product(*prod)):
            diff = (np.asarray(pair[0]) != np.asarray(pair[1])).astype(int)

            # Squared exponential kernel
            k = math.exp(-np.dot(diff, diff) / (2*sigma**2))
            ks.append(k)

        # Calculate sum and insert it into R[i,j] and R[j,i]
        ktot = np.sum(ks)
        R[idxs[0],idxs[1]] = ktot
        R[idxs[1],idxs[0]] = ktot

    # Normalize the similarity matrix R
    d = np.diag(R)**-0.5
    K = np.diag(d).dot(R).dot(np.diag(d))

    return K

K = kernel(data)
print(K)

Output:

[[1.         0.81009275 0.71374617 0.7365101  0.81061262]
 [0.81009275 1.         0.68228781 0.70349301 0.82009247]
 [0.71374617 0.68228781 1.         0.78137859 0.68976163]
 [0.7365101  0.70349301 0.78137859 1.         0.7365101 ]
 [0.81061262 0.82009247 0.68976163 0.7365101  1.        ]]

%timeit 2.86 ms ± 64.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, my code becomes very slow when there are much more training data or there are much more rows in each matrix. My speculation is that this is because I am using itertools.product too much and there is no matrix operation. I feel like there should be a fully vectorized numpy way to do this. I am thinking of np.cov and np.kron but don't know how it could work for strings. Any suggestion is appreciated.

Solution

You can get about one order of magnitude faster for the case of the list provided in your example if you replace the inner loop vectorized calls. I do not know how to vectorize the outer loop, as data consists of differently shaped arrays... Anyways, in the following snippet we use broadcasting to operate between arrays in a pairwise fashion, as well as einsum for extra coolness ;). The improvement is not as much as that obtained from @Jérôme Richard, but it only uses NumPy!

def kernel_oneLoop(data, sigma = 1.):
    # Convert matrices to arrays first
    data = np.array([np.asarray(d, dtype = object) for d in data])
    # Initialize the similarity kernel
    R = np.zeros(shape=(data.size, data.size))
    # Iterate through upper triangular matrix indices
    idx0, idx1 = np.triu_indices_from(R)
    for i in range(idx0.size):
        diff = (data[idx0[i]] != data[idx1[i]][:,None]).astype(int)
        # Squared exponential kernel (no need to square as they're 0's and 1's)
        k = np.exp(-diff.sum(axis=-1) / 2*sigma**2)
        # Calculate sum and insert it into R[i,j] and R[j,i]
        R[idx0[i],idx1[i]] = k.sum()
    # Normalize the similarity matrix R
    d = np.diag(R)**-0.5
    K = np.einsum("i,ij,j->ij", d, R, d)
    # Symmetrize
    K = K + K.T - np.diag(np.diag(K))
    
    return K

Edit: as well pointed out by Jérôme, as the output is a symmetrical array, we could just loop over the upper triangular indices of the array. I updated the code accordingly.