Search code examples
pythonscikit-learnscipydistance

Scikit-learn and Scipy yield diverging cosine distances -- after removing feature columns


When we take an N x M matrix with N observations and M features, a common task is to compute pairwise distances between the N observations, resulting in an N x N distance matrix. The popular Python libraries scipy and scikit-learn both provide methods for performing this task and we expect them to yield the same results for metrics that both have implemented. The following function tests the equivalence for a given matrix called arr:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import pdist, squareform

def test_equivalence(arr: np.array, metric="cosine") -> bool:
    scipy_result = squareform(pdist(arr, metric=metric))
    sklearn_result = pairwise_distances(arr, metric=metric)
    return np.isclose(scipy_result, sklearn_result).all()

Now I happen to have this 1219 x 37652 array arr where each row sums to 1 (normalized) and test_equivalence(arr) yields True, as expected. That is, the N x N cosine-distance matrices returned by both libraries can be used interchangeably. However, when I cull the last i columns, test_equivalence(arr[:, -i]) yields True only up to a certain value (which happens to be i = 25676). From this value onwards, the equivalence does not hold.

I am completely short of ideas why this is, any guidance? I may share the array as .npz file for debugging if someone can advise how, but maybe someone already has a premonition. The ultimate question will be, of course, which implementation should I be using?

I've also tested the failing arr[:, -25675] with these other metrics: ["braycurtis", "canberra", "chebyshev", "cityblock", "correlation", "euclidean", "hamming", "matching", "minkowski", "rogerstanimoto", "russellrao", "seuclidean", "sokalmichener", "sokalsneath", "sqeuclidean", "yule"] out of which all except "correlation" were equivalent.

Edit: A reduced (1219 x 96) array that fails the equivalence test can be downloaded from https://drive.switch.ch/index.php/s/B19JbTL5aZ4pY3f/download and loaded via np.load("tf_matrix.npz")["arr_0"].


Solution

  • There are a few diagnostics you can try in this situation. One diagnostic you can try is to plot the places where the two methods of computing distance disagree.

    # Modified version of test_equivalence() that returns boolean matrix of disagreements
    def test_equivalence(arr: np.array, metric="cosine"):
        scipy_result = squareform(pdist(arr, metric=metric))
        sklearn_result = pairwise_distances(arr, metric=metric)
        return np.isclose(scipy_result, sklearn_result)
    plt.imshow(test_equivalence(arr))
    

    This gives the following plot:

    equivalence plot

    Note that it is True everywhere except for a horizontal line around 1200, and a vertical line around 1200. So it's not that the two methods disagree about all vectors - they disagree in all comparisons involving a particular vector.

    Let's find out which column contains the vector that they disagree over:

    >>> row, col = np.where(~test_equivalence(arr))
    >>> print(col[0])
    1168
    

    Is there anything strange about vector 1168?

    >>> print(arr[1168])
    [1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23 1.20000016e-23 1.20000016e-23
     1.20000016e-23 1.20000016e-23]
    

    This vector is very, very small. Is it unusually small compared to other vectors, though? You can test this by graphing the euclidean length of each vector by position within the array.

    plt.scatter(np.arange(len(arr)), np.linalg.norm(arr, axis=1))
    plt.yscale('log')
    

    This plot shows that most vectors have a euclidean length of about 0.1, except one vector, which is 20 orders of magnitude smaller. It's vector 1168 again.

    log plot of vector norms

    To check this theory about small vectors causing the problem, here is an alternate way of showing the problem. I took your array, and repeatedly simplified it until I had a test case which was as simple as possible, but still showed the issue.

    arr_small = np.array([[1, 0], [1e-15, 1e-15]])
    print(test_equivalence(arr_small))
    print(squareform(pdist(arr_small, metric="cosine")))
    print(pairwise_distances(arr_small, metric="cosine"))
    

    Output:

    [[ True False]
     [False  True]]
    [[0.         0.29289322]
     [0.29289322 0.        ]]
    [[0. 1.]
     [1. 0.]]
    

    I declare two vectors, one with coordinates (1, 0), and the other with coordinates (1e-15, 1e-15). These should have an angle of 45 degrees between them. In cosine distance terms, that should be 1 - cos(45 degrees) = 0.292. The pdist() function agrees with this calculation.

    However, pairwise_distances() says the distance is 1. In other words, it says that the two vectors are orthogonal. Why does it do this? Let's look at the definition of cosine distance to understand why.

    cosine distance definition

    Image credit: SciPy documentation

    In this equation, if either u or v contain all zeros, then the denominator will be zero, and you'll have a division by zero, which is undefined. What pairwise_distances() does in this case is that in any case where the euclidean length of a vector is "too small," then the length of the vector is replaced with 1 instead to avoid the division by zero. This causes the numerator to be much smaller than the denominator, so the fraction is 0, and the distance becomes 1.

    More precisely, a vector is "too small" when the length of the vector is smaller than 10 times the machine epsilon for the relevant type, which is approximately 2.22e-15 for 64-bit floats. (Source.)

    In contrast, pdist() does not contain any code to avoid this division by zero.

    >>> print(squareform(pdist(np.array([[1, 0], [0, 0]]), metric="cosine")))
    [[ 0. nan]
     [nan  0.]]