Search code examples
pythoneuclidean-distancescipy-spatial

Why cdist from scipy.spatial.distance is so fast?


I wanted to create a distance proximity matrix for 10060 records/ points, where each record/point has 23 attributes using euclidean distance as metric. I wrote code using nested for loops to calculate distance between each point(leading to (n(n-1))/2) computations). It took a long time(about 8 minutes). When I used cdist it took so much lesser time(just 3 seconds !!!). When I looked at the source code, the cdist also uses nested for loops and moreover it makes n^2 computations(which is greater than the number of comparisons my logic does). What is making cdist execute faster and give correct output as well ? Please help me understand. Thanks in advance.


Solution

  • Where did you read the source code? The python code calls (if you follow it all the way down in the default metric='euclidean' case) the c code

    static NPY_INLINE int
    cdist_seuclidean(const double *XA, const double *XB, const double *var,
                     double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
                     const npy_intp num_cols)
    {
        npy_intp i, j;
    
        for (i = 0; i < num_rowsA; ++i) {
            const double *u = XA + (num_cols * i);
            for (j = 0; j < num_rowsB; ++j, ++dm) {
                const double *v = XB + (num_cols * j);
                *dm = seuclidean_distance(var, u, v, num_cols);
            }
        }
        return 0;
    }
    

    where seuclidean_distance is

    static NPY_INLINE double
    seuclidean_distance(const double *var, const double *u, const double *v,
                        const npy_intp n)
    {
        double s = 0.0;
        npy_intp i;
    
        for (i = 0; i < n; ++i) {
            const double d = u[i] - v[i];
            s += (d * d) / var[i];
        }
        return sqrt(s);
    }
    

    So it's actually a triple loop, but this is highly optimised C code. Python for loops are slow, they take up a lot of overhead and should never be used with numpy arrays because scipy/numpy can take advantage of the underlying memory data held within an ndarray object in ways that python can't.