I want to compute the squared euclidean in a fast way, as described here:
What is the fastest way to compute an RBF kernel in python?
Note1: I am only interested in the distance, not the RBF kernel.
Note2: I am neglecting numexpr
here and only use numpy directly.
In short, I compute:
|| x - y ||^2 = ||x||^2 + ||y||^2 - 2. * (x @ y.T)
I am able to compute the distance matrix faster by a factor of ~10
compared to scipy.pdist
with this. However, I observe numerical issues, which get worse if I take the square root to get the euclidean distance. I have values that are in the order of 1E-8 - 1E-7
, which should be exactly zero (i.e. duplicated points or distance to self point).
Are there ways or ideas to overcome these numerical issues (perferable without sacrificing too much of the evaluation speed)? Or are the numerical issues the reason why this path is not taken (e.g. by scipy.pdist
) in the first place?
This is a small code example to show the numerical issues (not the speed ups, please look at the answers of the linked SO thread above).
import numpy as np
M = np.random.rand(1000, 10)
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
unique = np.unique(np.diag(res)) # analytically all diag values are exactly zero
sqrt_unique = np.sqrt(unique)
print(unique)
print(sqrt_unique)
[-2.66453526e-15 -1.77635684e-15 -8.88178420e-16 -4.44089210e-16
0.00000000e+00 4.44089210e-16 8.88178420e-16 1.77635684e-15
3.55271368e-15]
[ nan nan nan nan
0.00000000e+00 2.10734243e-08 2.98023224e-08 4.21468485e-08
5.96046448e-08]
As you can see some values are also negative (which results in nan
after taking the sqrt). Of course these are easy to catch -- but the small positives have a large error for the euclidean case (e.g. abs_error=5.96046448e-08
)
as per my comment, using abs
is probably your best option for cleaning up the numerical stability inherent in this algorithm. as you're concerned about performance you should probably be using the mutating assignment operators as they cause less garbage to be created and hence can be much faster. also, when running this with many features (e.g. 10k) I see pdist
being slower than this implementation.
putting the above together we get:
import numpy as np
def edist0(M):
"calculate pairwise euclidean distance"
M_norm = np.sum(M**2, axis=1)
res = M_norm[:, np.newaxis] + M_norm[np.newaxis, :] - 2. * M @ M.T
return np.sqrt(np.abs(res))
def edist1(M):
"optimised calculation of pairwise euclidean distance"
M_norm = np.einsum('ij,ij->i', M, M)
res = M @ M.T
res *= -2.
res += M_norm[:, np.newaxis]
res += M_norm[np.newaxis, :]
return np.sqrt(np.abs(res, out=res), out=res)
timing this in IPython with:
from scipy.spatial import distance
M = np.random.rand(1000, 10000)
%timeit distance.squareform(distance.pdist(M))
%timeit edist0(M)
%timeit edist1(M)
I get:
2.82 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
296 ms ± 6.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
153 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
and no errors/warnings from sqrt
the linked question also points to scikit-learn as having good distance kernel good implementations, the euclidean one being pairwise_distances
which benchmarks as:
from sklearn.metrics import pairwise_distances
%timeit pairwise_distances(M)
170 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which might be nice to use if you're already using that package