I got this code where distance is a lower triangular matrix defined as this:
distance = np.tril(scipy.spatial.distance.cdist(points, points))
def make_them_touch(distance):
"""
Return the every distance where two points touched each other. See example below.
"""
thresholds = np.unique(distance)[1:] # to avoid 0 at the beginning, not taking a lot of time at all
result = dict()
for t in thresholds:
x, y = np.where(distance == t)
result[t] = [i for i in zip(x,y)]
return result
My problem is that np.where is quite slow with big matrix (2000*100 for example).
How can I speed up this code, by improving np.where or by changing the algo ?
EDIT : as MaxU pointed out, the best optimization here is NOT to generate a squared matrix and to use iterators.
points = np.array([
...: [0,0,0,0],
...: [1,1,1,1],
...: [3,3,3,3],
...: [6,6,6,6]
...: ])
In [106]: distance = np.tril(scipy.spatial.distance.cdist(points, points))
In [107]: distance
Out[107]:
array([[ 0., 0., 0., 0.],
[ 2., 0., 0., 0.],
[ 6., 4., 0., 0.],
[12., 10., 6., 0.]])
In [108]: make_them_touch(distance)
Out[108]:
{2.0: [(1, 0)],
4.0: [(2, 1)],
6.0: [(2, 0), (3, 2)],
10.0: [(3, 1)],
12.0: [(3, 0)]}
UPDATE1: here is a snippet for an upper triangular distance matrix (it shouldn't really matter as the distance matrix is always symmetric):
from itertools import combinations
res = {tup[0]:tup[1] for tup in zip(pdist(points), list(combinations(range(len(points)), 2)))}
result:
In [111]: res
Out[111]:
{1.4142135623730951: (0, 1),
4.69041575982343: (0, 2),
4.898979485566356: (1, 2)}
UPDATE2: this version will support duplicates in distances:
In [164]: import pandas as pd
first we construct a Pandas.Series:
In [165]: s = pd.Series(list(combinations(range(len(points)), 2)), index=pdist(points))
In [166]: s
Out[166]:
2.0 (0, 1)
6.0 (0, 2)
12.0 (0, 3)
4.0 (1, 2)
10.0 (1, 3)
6.0 (2, 3)
dtype: object
now we can group by index and produce lists of coordinates:
In [167]: s.groupby(s.index).apply(list)
Out[167]:
2.0 [(0, 1)]
4.0 [(1, 2)]
6.0 [(0, 2), (2, 3)]
10.0 [(1, 3)]
12.0 [(0, 3)]
dtype: object
PS the main idea here is that you shouldn't build square distance matrix if you are going to flatten it afterwards and to get rid of duplicates.