My dataset can be found in kaggle https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python. So i'm running k-means on my dataset that has 4 columns and 200 rows with k = 5. I wanted to find the cluster radius so I measured the average distance of each data point from their respective cluster center but whenever I re-run my program their values change. My cluster centers don't change with each iteration so what's going on exactly? How do I fix this?
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler
import numpy as np
import scipy.spatial.distance as sdist
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
centroids = kmeans.cluster_centers_
print(centroids)
df["cluster"] = kmeans.labels_
n_clusters = 5
clusters = [x[y_kmeans == i] for i in range(n_clusters)]
for i, c in enumerate(clusters):
print('Cluster {} has {} observations: {}...'.format(i, len(c), c[0]))
df["cluster"] = kmeans.labels_
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
#cluster radius
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data = PCA(n_components=2).fit_transform(x)
k_means = KMeans()
clusters = k_means.fit_predict(t_data)
centroids = kmeans.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
print("mean distances are", c_mean_distances)
Output 1 [1.5381892556224435, 1.796763983963032, 1.5144402423920744, 3.4372440532366753, 1.6533031213582314]
Iteration 2 ```[3.180393284279158, 2.809194267986748, 0.7823704675079582, 3.4929008204149365, 1.8109097594336663]
Iteration 3 [1.9461073260609538, 3.2032294269352155, 2.447917517713439, 3.4372440532366753, 2.197239028470577]
I'll add the answer to document the issue.
First, when you are doing a lower dimensional embedding make sure that it doesn't need a random seed to ensure repeatability. In this case (PCA) I think it is ok, but other lower dimensional embedding's may vary.
Second, KMeans does not always converge to a global optima and thus can have varying convergence clusters. To keep KMeans repeatable Scikit Learn has the random_state
input parameter.
You set this the first time you ran KMeans. This kept the first portion of your code repeatable. To ensure repeatability on the clustering after PCA embedding, set the random state in the same way:
k_means = KMeans(n_clusters=5, max_iter=100, random_state=0)