Search code examples
pythonnumpyscipypython-itertools

How to make np.where more efficient with triangular matrices?


I got this code where distance is a lower triangular matrix defined as this:

distance = np.tril(scipy.spatial.distance.cdist(points, points))  
def make_them_touch(distance):
    """
    Return the every distance where two points touched each other. See example below.
    """
    thresholds = np.unique(distance)[1:] # to avoid 0 at the beginning, not taking a lot of time at all
    result = dict()
    for t in thresholds:
            x, y = np.where(distance == t)
            result[t] = [i for i in zip(x,y)]
    return result

My problem is that np.where is quite slow with big matrix (2000*100 for example).
How can I speed up this code, by improving np.where or by changing the algo ?

EDIT : as MaxU pointed out, the best optimization here is NOT to generate a squared matrix and to use iterators.

Example :

points = np.array([                                                                        
...: [0,0,0,0],                                                            
...: [1,1,1,1],         
...: [3,3,3,3],              
...: [6,6,6,6]                             
...: ])  

In [106]: distance = np.tril(scipy.spatial.distance.cdist(points, points))

In [107]: distance
Out[107]: 
array([[ 0.,  0.,  0.,  0.],
   [ 2.,  0.,  0.,  0.],
   [ 6.,  4.,  0.,  0.],
   [12., 10.,  6.,  0.]])

In [108]: make_them_touch(distance)
Out[108]: 
{2.0: [(1, 0)],
 4.0: [(2, 1)],
 6.0: [(2, 0), (3, 2)],
 10.0: [(3, 1)],
 12.0: [(3, 0)]}

Solution

  • UPDATE1: here is a snippet for an upper triangular distance matrix (it shouldn't really matter as the distance matrix is always symmetric):

    from itertools import combinations
    
    res = {tup[0]:tup[1] for tup in zip(pdist(points), list(combinations(range(len(points)), 2)))}
    

    result:

    In [111]: res
    Out[111]:
    {1.4142135623730951: (0, 1),
     4.69041575982343: (0, 2),
     4.898979485566356: (1, 2)}
    

    UPDATE2: this version will support duplicates in distances:

    In [164]: import pandas as pd
    

    first we construct a Pandas.Series:

    In [165]: s = pd.Series(list(combinations(range(len(points)), 2)), index=pdist(points))
    
    In [166]: s
    Out[166]:
    2.0     (0, 1)
    6.0     (0, 2)
    12.0    (0, 3)
    4.0     (1, 2)
    10.0    (1, 3)
    6.0     (2, 3)
    dtype: object
    

    now we can group by index and produce lists of coordinates:

    In [167]: s.groupby(s.index).apply(list)
    Out[167]:
    2.0             [(0, 1)]
    4.0             [(1, 2)]
    6.0     [(0, 2), (2, 3)]
    10.0            [(1, 3)]
    12.0            [(0, 3)]
    dtype: object
    

    PS the main idea here is that you shouldn't build square distance matrix if you are going to flatten it afterwards and to get rid of duplicates.