Search code examples
pythonnumpyscipycluster-analysiseuclidean-distance

Cluster Analysis: Problem finding Euclidean distances of centroids in a dataframe from origin


The 7 columns for each row in df_centroids show the coordinates in a 7 dimensional space.

import numpy as np 
import pandas as pd 
import scipy
df_centroids

        0           1           2           3           4            5          6
0   2.443664    -0.158806   -0.403137   0.609063    -0.412371   -0.486611   -0.687598
1   -0.389052   1.258986    -0.517471   -0.127748   0.379712    -0.486611   -0.143564
2   -0.215555   0.201088    1.149816    -0.501471   0.275600    -0.088475   1.434132
3   -0.227075   -0.806379   -0.412111   -0.174150   -0.417327   -0.401676   -0.234962
4   -0.130615   0.197548    1.282325    -0.940454   0.161774    2.167632    -0.263252
5   0.015202    -0.125552   -0.665733   1.792274    -0.360096   -0.390093   -0.044649

I'm trying to calculate the Euclidean distance from origin and save it under 'Euclidean Distance' column. Please see code below:

df_centroids['Euclidean Distance']=''
from scipy.spatial import distance

i=0
while i<len(df_centroids.index):
    centroid=[df_centroids.iloc[i,0], df_centroids.iloc[i,1], df_centroids.iloc[i,2], df_centroids.iloc[i,3], df_centroids.iloc[i,4], df_centroids.iloc[i,5], df_centroids.iloc[i,6]]
    df_centroids[i,7]=distance.euclidean([0, 0, 0, 0, 0, 0, 0], centroid)
    i+=1
df_centroids

       0            1          2          3             4           5           6      'Euclidean Distance'     (0, 7)      (1, 7)       (2, 7)       (3, 7)      (4, 7)     (5, 7)      (6, 7)      (7, 7)
0   2.443664    -0.158806   -0.403137   0.609063    -0.412371   -0.486611   -0.687598                       2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
1   -0.389052   1.258986    -0.517471   -0.127748   0.379712    -0.486611   -0.143564                       2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
2   -0.215555   0.201088    1.149816    -0.501471   0.275600    -0.088475   1.434132                        2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
3   -0.227075   -0.806379   -0.412111   -0.174150   -0.417327   -0.401676   -0.234962                       2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
4   -0.130615   0.197548    1.282325    -0.940454   0.161774    2.167632    -0.263252                       2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
5   0.015202    -0.125552   -0.665733   1.792274    -0.360096   -0.390093   -0.044649                       2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
6   0.256554    1.422368    1.139299    -0.917565   6.804388    -0.486611   0.726889                        2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439
7   6.010360    0.643581    2.401293    -1.193860   0.068166    1.636784    0.726889                        2.722099    1.556305    1.949607    1.136964    2.716432    1.988787    7.161965    6.851439

As you see, instead of calculating Euclidean space, the code is creating 8 new columns and copying the same set of values for all rows. Where am I going wrong?

I have tried looking up online for a solution but no luck so far. Would really appreciate any help.


Solution

  • When working with numpy, you usually never have to use loops. Highly tuned vector and matrix operations exist for most use cases.

    For your problem, note that the Euclidean distance to the origin is the same as the Euclidean norm. There is a function in numpy.linalg for that.

    To calculate the Euclidean (l-2) norm of one vector:

    import np
    np.linalg.norm([1, 2, 3])
    # 3.7416573867739413
    

    To calculate the norm for a matrix of row vectors individually for each row (as in your problem):

    np.linalg.norm([[1,2,3],
                    [4,5,6]], axis=1)
    # array([3.74165739, 8.77496439])
    

    To calculate the norm for a matrix of column vectors individually for each column:

    np.linalg.norm([[1, 4],
                    [2, 5],
                    [3, 6]], axis=0)
    # array([3.74165739, 8.77496439])