Search code examples

scipy.spatial.distance.euclidean and scipy.spatial.- distance_matrix not returning the same result

I was using agglomerative clustering technique to cluster a vehicle dataset. I used two methods to calculate the distance matrix, one was using scipy.spatial.distance.euclidean and other using scipy.spatial-distance_matrix.

So according to my understanding I should get the same results in both the cases. Which I think I am getting but when I am comparing the output of both methods for some elements I am getting false as output. Why is this happening?

Steps to reproduce :

!wget -O cars_clus.csv
filename = 'cars_clus.csv'

#Read csv
pdf = pd.read_csv(filename)

# Clean the data
pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
pdf = pdf.dropna()
pdf = pdf.reset_index(drop=True)

# selecting the feature set
featureset = pdf[['engine_s',  'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]

# Normalised using minmax
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)

calculate the distance matrix.

#M1 : Using scipy's euclidean

import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
    for j in range(leng):
        D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])

enter image description here

# M2 : using scipy.spatial's distance_matrix

from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(feature_mtx,feature_mtx))

enter image description here

As you can see even though both the results are same when I am comparing both the matrix, I am unable to get true for each element

# Comparing

pd.DataFrame(dist_matrix == D).head()

enter image description here


  • Buiding on Graipher answer you can try this :

    comp = np.isclose(dist_matrix, D)

    Now coming to your question why was this happening. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors. People are often very surprised by results like this:

    >>> 1.2-1.0

    It’s not an error. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors.

    Floating point numbers only have 32 or 64 bits of precision, so the digits are cut off at some point