I am currently following a course on the basics of machine learning provided by IBM. After the teacher finished building the model, I noticed that he does not use the normalized data to fit the model, but rather uses regular data and in the end he gets a good cluster and non-overlapping clusters. But when I tried to use the normalized data to train the model, I got a catastrophe and I got nested clusters, as the code and image show. Why did the process of normalization lead to that? Although it is always good "as I know" to use normalization in mathematical basis algorithms.
code does not use normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_featur = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(X)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=kmeans.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
CLUSTERS WITH OUT USING NORMALIZATION
code using normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_feature = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(norm_feature)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( norm_feature[:, 1])**2
plt.scatter(norm_feature[:, 0], norm_feature[:, 3], s=area, c=kmeans.labels_.astype(np.float),
alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
Income and age are on fairly different scales here. In your first plot, a difference of ~100 in income is about the same as a difference of ~10 in age. But in k-means, that difference in income is considered 10x larger. The vertical axis easily dominates the clustering.
This is probably 'wrong', unless you happen to believe that a change of 1 in income is 'the same as' a change of in 10 age, for purposes of figuring out what's similar. This is why you standardize, which makes a different assumption: that they are equally important.
Your second plot doesn't quite make sense; k-means can't produce 'overlapping' clusters. The problem is that you have only plotted 2 of the 4 (?) dimensions you clustered on. You can't plot 4D data, but I suspect that if you applied PCA to the result to reduce to 2 dimensions first and plotted it, you'd see separated clusters.