Search code examples
pythonnumpycategorical-datanumbapairwise-distance

Efficient implementation of pairwise distances computation between observations for mixed numeric and categorical data


I am working on a data science project in which I have to compute the euclidian distance between every pair of observations in a dataset.

Since I am working with very large datasets, I have to use an efficient implementation of pairwise distances computation (both in terms of memory usage and computation time).

One solution is to use the pdist function from Scipy, which returns the result in a 1D array, without duplicate instances.

However, this function is not able to deal with categorical variables. For these, I want to set the distance to 0 when the values are the same and 1 otherwise.

I have tried to implement this variant in Python with Numba. The function takes as input the 2D Numpy array containing all the observations and a 1D array containing the types of the variables (either float64 or category).

Here is the code :

import numpy as np
from numba.decorators import autojit

def pairwise(X, types):
    m = X.shape[0]
    n = X.shape[1]

    D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float)
    ind = 0

    for i in range(m):
        for j in range(i+1, m):
            d = 0.0

            for k in range(n):
                if types[k] == 'float64':
                    tmp = X[i, k] - X[j, k]
                    d += tmp * tmp
                else:
                    if X[i, k] != X[j, k]:
                        d += 1.

            D[ind] = np.sqrt(d)
            ind += 1

    return D.reshape(1, -1)[0]

pairwise_numba = autojit(pairwise)

vectors = np.random.rand(20000, 100)
types = np.array(['float64']*100)

dists = pairwise_numba(vectors, types)

This implementation is very slow despite the use of Numba. Is it possible to improve my code to make it faster ?


Solution

  • In case you really want numba to perform fast you need to jit the function in nopython mode, otherwise numba may fall back to object mode which is slower (and can be quite slow).

    However your function cannot be compiled (as of numba version 0.43.1) in nopython mode, that's because:

    • the dtype argument to np.empty. np.float is simply Pythons float and will be translated by NumPy (but not numba) to np.float_. If you use numba you have to use that.
    • String support in numba is lacking. So the types[k] == 'float64' line will not compile.

    The first issue is trivially fixe. Regarding the second issue: instead of trying to make the string comparisons work just provide a boolean array. Using a boolean array and evaluating one boolean for thruthiness will also be significantly faster than comparing up to 7 characters. Especially if it's in the innermost loop!

    So it might look like this:

    import numpy as np
    import numba as nb
    
    @nb.njit
    def pairwise_numba(X, is_float_type):
        m = X.shape[0]
        n = X.shape[1]
    
        D = np.empty((int(m * (m - 1) / 2), 1), dtype=np.float64)  # corrected dtype
        ind = 0
    
        for i in range(m):
            for j in range(i+1, m):
                d = 0.0
    
                for k in range(n):
                    if is_float_type[k]:
                        tmp = X[i, k] - X[j, k]
                        d += tmp * tmp
                    else:
                        if X[i, k] != X[j, k]:
                            d += 1.
    
                D[ind] = np.sqrt(d)
                ind += 1
    
        return D.reshape(1, -1)[0]
    
    dists = pairwise_numba(vectors, types == 'float64')  # pass in the boolean array
    

    However you can simplify the logic if you combine scipy.spatial.distances.pdist on the float types with a numba logic to count the unequal categorials:

    from scipy.spatial.distance import pdist
    
    @nb.njit
    def categorial_sum(X):
        m = X.shape[0]
        n = X.shape[1]
        D = np.zeros(int(m * (m - 1) / 2), dtype=np.float64)  # corrected dtype
        ind = 0
    
        for i in range(m):
            for j in range(i+1, m):
                d = 0.0
                for k in range(n):
                    if X[i, k] != X[j, k]:
                        d += 1.
                D[ind] = d
                ind += 1
    
        return D
    
    def pdist_with_categorial(vectors, types):
        where_float_type = types == 'float64'
        # calculate the squared distance of the float values
        distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
        # sum the number of mismatched categorials and add that to the distances 
        # and then take the square root
        return np.sqrt(distances_squared + categorial_sum(vectors[:, ~where_float_type]))
    

    It won't be significantly faster, but it drastically simplified the logic in the numba function.

    Then you can also avoid the additional array creations by passing in the squared distances to the numba function:

    @nb.njit
    def add_categorial_sum_and_sqrt(X, D):
        m = X.shape[0]
        n = X.shape[1]
        ind = 0
        for i in range(m):
            for j in range(i+1, m):
                d = 0.0
                for k in range(n):
                    if X[i, k] != X[j, k]:
                        d += 1.
                D[ind] = np.sqrt(D[ind] + d)
                ind += 1
    
        return D
    
    def pdist_with_categorial(vectors, types):
        where_float_type = types == 'float64'
        distances_squared = pdist(vectors[:, where_float_type], metric='sqeuclidean')
        return add_categorial_sum_and_sqrt(vectors[:, ~where_float_type], distances_squared)