I am trying to implement a K-Means clustering algorithm, however more often than not I get the following error
C:\Users\andre\AppData\Roaming\Python\Python37\site-packages\numpy\core\fromnumeric.py:3257:
RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
C:\Users\andre\AppData\Roaming\Python\Python37\site-packages\numpy\core\_methods.py:161:
RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
I traced the problem to the part of my code that tries to find the new centroid by taking the average value. 'Points' will turn an empty array causing me to get stuck in my while
loop. I can't understand why.
import numpy as np
from copy import deepcopy
def compute_euclidean_distance(vec1,vec2,ax):
return np.linalg.norm(vec1 - vec2, axis = ax)
def initalise_centroids(dataset, k):
rand_x = np.random.randint(np.min(dataset),np.max(dataset), size =k)
rand_y = np.random.randint(np.min(dataset),np.max(dataset), size =k)
centroids = np.array(list(zip(rand_x,rand_y)), dtype=np.float32)
return centroids
def kmeans(dataset, k):
err = 0
cent = initalise_centroids(dataset,k)
cOld = np.zeros(cent.shape)
clusters = np.zeros(len(dataset))
err = compute_euclidean_distance(cent, cOld, None)
count = 0
while err !=0:
for i in range(len(dataset)):
dist = compute_euclidean_distance(dataset[i], cent, 1)
cluster = np.argmin(dist)
clusters[i] = cluster
cOld= deepcopy(cent)
for i in range(k):
points = [dataset[j] for j in range(len(dataset)) if clusters [j] == i ]
cent[i] = np.mean(points,axis =0)
err = compute_euclidean_distance(cent, cOld, None)
print(err)
count +=1
return cent,clusters,err
I noticed two things:
while err != 0
will most likely never be reached. Usually the user will set a error threshold such that when the actual error is below that value, the loop will exit. In Sklearn's Kmeans documentation, you can see this in the tol
parameter.for loop
assumes that each cluster will have some points assigned to it. This may not be the case. For example, I ran your code with the following input [(1,100),(1,100),(100,100)], 2
. You would think that the algorithm would converge into two clusters of the first two points, and the last point.[[29,78],[62,25]]
. In this case, all my points got assigned into cluster 0
at first.range(k)
, there was no points for cluster 1
, which is why you might see the nan
values in your output. You may want to look at other cluster center initialization algorithms like k-means++
hope that helps!