I was using agglomerative clustering technique to cluster a vehicle dataset. I used two methods to calculate the distance matrix, one was using scipy.spatial.distance.euclidean and other using scipy.spatial-distance_matrix.
So according to my understanding I should get the same results in both the cases. Which I think I am getting but when I am comparing the output of both methods for some elements I am getting false as output. Why is this happening?
Steps to reproduce :
!wget -O cars_clus.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv
filename = 'cars_clus.csv'
#Read csv
pdf = pd.read_csv(filename)
# Clean the data
pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
pdf = pdf.dropna()
pdf = pdf.reset_index(drop=True)
# selecting the feature set
featureset = pdf[['engine_s', 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]
# Normalised using minmax
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
#M1 : Using scipy's euclidean
import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng,leng])
for i in range(leng):
for j in range(leng):
D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
print(pd.DataFrame(D).head())
# M2 : using scipy.spatial's distance_matrix
from scipy.spatial import distance_matrix
dist_matrix = distance_matrix(feature_mtx,feature_mtx))
print(pd.DataFrame(dist_matrix).head())
As you can see even though both the results are same when I am comparing both the matrix, I am unable to get true for each element
# Comparing
pd.DataFrame(dist_matrix == D).head()
Buiding on Graipher answer you can try this :
comp = np.isclose(dist_matrix, D)
pd.DataFrame(comp).head()
Now coming to your question why was this happening. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors. People are often very surprised by results like this:
>>> 1.2-1.0
0.199999999999999996
It’s not an error. It’s a problem caused by the internal representation of floating point numbers, which uses a fixed number of binary digits to represent a decimal number. Some decimal numbers can’t be represented exactly in binary, resulting in small roundoff errors.
Floating point numbers only have 32 or 64 bits of precision, so the digits are cut off at some point