Search code examples
pythonscipyhierarchical-clusteringeuclidean-distancedistance-matrix

scipy.spatial.distance.euclidean and scipy.spatial.- distance_matrix not returning the same result


I was using agglomerative clustering technique to cluster a vehicle dataset. I used two methods to calculate the distance matrix, one was using scipy.spatial.distance.euclidean and other using scipy.spatial-distance_matrix.

So according to my understanding I should get the same results in both the cases. Which I think I am getting but when I am comparing the output of both methods for some elements I am getting false as output. Why is this happening?

Steps to reproduce :

!wget -O cars_clus.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv
filename = 'cars_clus.csv'

#Read csv
pdf = pd.read_csv(filename)

# Clean the data
pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
pdf = pdf.dropna()
pdf = pdf.reset_index(drop=True)

# selecting the feature set
featureset = pdf[['engine_s',  'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]

# Normalised using minmax
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)

calculate the distance matrix.

#M1 : Using scipy's euclidean

import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
print(pd.DataFrame(D).head())

enter image description here

# M2 : using scipy.spatial's distance_matrix

from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(feature_mtx,feature_mtx))
print(pd.DataFrame(dist_matrix).head())

enter image description here

As you can see even though both the results are same when I am comparing both the matrix, I am unable to get true for each element

# Comparing

pd.DataFrame(dist_matrix == D).head()

enter image description here


Solution

  • Buiding on Graipher answer you can try this :

    comp = np.isclose(dist_matrix, D)
    pd.DataFrame(comp).head()
    

    Now coming to your question why was this happening. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors. People are often very surprised by results like this:

    >>> 1.2-1.0
    0.199999999999999996
    

    It’s not an error. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors.

    Floating point numbers only have 32 or 64 bits of precision, so the digits are cut off at some point