I have a matrix of vectors where each row is a vector. I want to take the mean of all the vectors, then calculate the cosine distance between each vector and this mean, returning an array of distances.
>>> x = arange(1,10).reshape(3,3)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> m = x.mean(0)
array([4., 5., 6.])
The cosine values are as follows
>>> from scipy.spatial.distance import cosine
cosine([1,2,3], [4,5,6])
0.0253681538029239
>>> cosine([4,5,6], [4,5,6])
0.0
>>> cosine([7,8,9], [4,5,6])
0.001809107314273195
Therefore I want to write a function f
such that
>>> f(x, m)
array([0.0253681538029239, 0.0, 0.001809107314273195])
(Or the transpose of such an array. It doesn't matter.)
What is the most efficient, most numpythonic way to write f
? It seems like the trick is to get the proper broadcast over the cosine
function, but I haven't figured out how to do this. The following doesn't work.
>>> from numpy import frompyfunc
>>> f = frompyfunc(cosine, 2, 1)
>>> f(x, m)
array([[0.0, 0.0, 0.0],
[0.0, 0.0, 0.0],
[0.0, 0.0, 0.0]], dtype=object)
(It looks like here numpy is applying cosine
element-wise instead of row-wise.)
Is there a way to do this without writing a for
-loop?
It looks like this is possible with apply_along_axis
.
>>> from numpy import apply_along_axis
>>> from functools import partial
>>> g = partial(cosine, m)
>>> apply_along_axis(g, 1, x)
array([0.02536815, 0. , 0.00180911])
Is this the most efficient way?
You need to reshape your mean array to be 2D.
>>> from scipy.spatial.distance import cdist
>>> cdist(x, m.reshape(1, -1), metric='cosine')
array([[2.53681538e-02],
[2.22044605e-16],
[1.80910731e-03]])