python arrays python-3.x numpy performance

Fastest way to merge columns of the same type in Numpy

Say I have a numpy array A that records the distances from different 1D points. The first column records the coordinate of each destination point, the second column records the pairwise distances. Now I want to construct a distance matrix D that records the pairwise distances (rows correspond to source points, columns correspond to destination points, the distance will be recorded as infinite if unknown). This should be easy, but becomes tricky when there are duplicate destination points. Below is a simple example: I have two points with the coordinate 2.5, and two points with the coordinate 6.1. I tried using np.unique to get the unique columns, but when there are duplicates, it picks the first-appeared distance and lose the distance information of the other pairs

import numpy as np

A = np.array([[7.3, 0.25],
              [2.5, 0.32],
              [3.7, 0.45],
              [6.1, 0.55],
              [2.5, 0.91],
              [4.8, 0.77],
              [8.6, 0.35],
              [6.1, 0.82]])
D = np.ones((A.shape[0],A.shape[0])) * np.inf
np.fill_diagonal(D, A[:,1])
unq_ids = np.sort(np.unique(A[:,0], return_index=True)[1])
D = D[:,unq_ids]
print(D)

This gives

array([[0.25,  inf,  inf,  inf,  inf,  inf],
       [ inf, 0.32,  inf,  inf,  inf,  inf],
       [ inf,  inf, 0.45,  inf,  inf,  inf],
       [ inf,  inf,  inf, 0.55,  inf,  inf],
       [ inf,  inf,  inf,  inf,  inf,  inf],
       [ inf,  inf,  inf,  inf, 0.77,  inf],
       [ inf,  inf,  inf,  inf,  inf, 0.35],
       [ inf,  inf,  inf,  inf,  inf,  inf]])

As you can see, the row 4 and 7 doesn't have any distance information, but I am expecting

array([[0.25,  inf,  inf,  inf,  inf,  inf],
       [ inf, 0.32,  inf,  inf,  inf,  inf],
       [ inf,  inf, 0.45,  inf,  inf,  inf],
       [ inf,  inf,  inf, 0.55,  inf,  inf],
       [ inf, 0.91,  inf,  inf,  inf,  inf],
       [ inf,  inf,  inf,  inf, 0.77,  inf],
       [ inf,  inf,  inf,  inf,  inf, 0.35],
       [ inf,  inf,  inf, 0.82,  inf,  inf]])

In fact, I have a more than 10000 pairs of points, so I want to know a fast way in Numpy to first detect duplicates of the destination points, and then "merge" the distance matrix's columns that correspond to the same destination point (please don't use Pandas).

Solution

In [104]: unq_ids = np.sort(np.unique(A[:,0], return_index=True)[1])

In [105]: unq_ids
Out[105]: array([0, 1, 2, 3, 5, 6])

With those numbers, just the row/column indices, D[:,unq_ids] is just going to return D as you filled with the diagonal.

In [110]: np.unique(A[:,0], return_index=True, return_inverse=True, return_counts=True)
Out[110]: 
(array([2.5, 3.7, 4.8, 6.1, 7.3, 8.6]),
 array([1, 2, 5, 3, 0, 6]),
 array([4, 0, 1, 3, 0, 2, 5, 3]),
 array([2, 1, 1, 2, 1, 1]))

counts tells us how many duplicates there are, but unique does not identify the indices of those duplicates.

In general, numpy isn't that good at 'matching' things. We can do matching in numpy, but it takes some work. Pandas has a convenient group_by, though I've never tested it for speed.

In [112]: np.sort(A[:,0])
Out[112]: array([2.5, 2.5, 3.7, 4.8, 6.1, 6.1, 7.3, 8.6])
In [113]: np.argsort(A[:,0])
Out[113]: array([1, 4, 2, 5, 3, 7, 0, 6])

We could compare that with the return_index, array([1, 2, 5, 3, 0, 6]), to find that 4 and 7 are the duplicates.

In [115]: D[:,[4,7]]
Out[115]: 
array([[ inf,  inf],
       [ inf,  inf],
       [ inf,  inf],
       [ inf,  inf],
       [0.91,  inf],
       [ inf,  inf],
       [ inf,  inf],
       [ inf, 0.82]])

In [117]: A[[4,7]]
Out[117]: 
array([[2.5 , 0.91],
       [6.1 , 0.82]])

So with the unique A[:,0] values, we can find all 'groups'

In [120]: U = np.array([2.5, 3.7, 4.8, 6.1, 7.3, 8.6])
In [123]: for i in U:
     ...:     x = np.nonzero(A[:,0]==i)
     ...:     print(x,A[x,1])
     ...:     
(array([1, 4]),) [[0.32 0.91]]
(array([2]),) [[0.45]]
(array([5]),) [[0.77]]
(array([3, 7]),) [[0.55 0.82]]
(array([0]),) [[0.25]]
(array([6]),) [[0.35]]

So now all we need to do is map those onto your D (with sorted unique 'indices'):