Search code examples
pythonpython-2.7python-3.xnumpyarray-broadcasting

Calculate Distances Between One Point in Matrix From All Other Points


I am new to Python and I need to implement a clustering algorithm. For that, I will need to calculate distances between the given input data.

Consider the following input data -

    [[1,2,8],
     [7,4,2],
     [9,1,7],
     [0,1,5],
     [6,4,3]]

What I am looking to achieve here is, I want to calculate distance of [1,2,8] from ALL other points, and find a point where the distance is minimum.

And I have to repeat this for ALL other points.

I am trying to implement this with a FOR loop, but I am sure that SciPy/ NumPy must be having a function which can help me achieve this result efficiently.

I looked online, but the 'pdist' command could not get my work done.

Can someone guide me?

TIA


Solution

  • Use np.linalg.norm combined with broadcasting (numpy outer subtraction), you can do:

    np.linalg.norm(a - a[:,None], axis=-1)
    

    a[:,None] insert a new axis into a, a - a[:,None] will then do a row by row subtraction due to broadcasting. np.linalg.norm calculates the np.sqrt(np.sum(np.square(...))) over the last axis:


    a = np.array([[1,2,8],
         [7,4,2],
         [9,1,7],
         [0,1,5],
         [6,4,3]])
    
    np.linalg.norm(a - a[:,None], axis=-1)
    #array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
    #       [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
    #       [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
    #       [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
    #       [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])
    

    The elements [0,1], [0,2] for instance correspond to:

    np.sqrt(np.sum((a[0] - a[1]) ** 2))
    # 8.717797887081348
    
    np.sqrt(np.sum((a[0] - a[2]) ** 2))
    # 8.1240384046359608
    

    respectively.