Search code examples
pythonnumpyk-means

Don't understand these RuntimeWarnings during k-means clustering vector quantization


I am trying to implement a K-Means clustering algorithm, however more often than not I get the following error

C:\Users\andre\AppData\Roaming\Python\Python37\site-packages\numpy\core\fromnumeric.py:3257:
    RuntimeWarning: Mean of empty slice.
out=out, **kwargs)

C:\Users\andre\AppData\Roaming\Python\Python37\site-packages\numpy\core\_methods.py:161:
    RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)

I traced the problem to the part of my code that tries to find the new centroid by taking the average value. 'Points' will turn an empty array causing me to get stuck in my while loop. I can't understand why.

import numpy as np
from copy import deepcopy

def compute_euclidean_distance(vec1,vec2,ax):
    return np.linalg.norm(vec1 - vec2, axis = ax)

def initalise_centroids(dataset, k):
    rand_x = np.random.randint(np.min(dataset),np.max(dataset), size =k)
    rand_y = np.random.randint(np.min(dataset),np.max(dataset), size =k)
    centroids = np.array(list(zip(rand_x,rand_y)), dtype=np.float32)
    return centroids

def kmeans(dataset, k):
    err = 0
    cent = initalise_centroids(dataset,k)
    cOld = np.zeros(cent.shape)
    clusters = np.zeros(len(dataset))
    err = compute_euclidean_distance(cent, cOld, None)
    count = 0

    while err !=0:

        for i in range(len(dataset)):
            dist = compute_euclidean_distance(dataset[i], cent, 1)
            cluster = np.argmin(dist)
            clusters[i] = cluster

        cOld= deepcopy(cent)

        for i in range(k):
            points = [dataset[j] for j in range(len(dataset)) if clusters [j] == i ]
            cent[i] = np.mean(points,axis =0)

        err = compute_euclidean_distance(cent, cOld, None)
        print(err)
        count +=1

    return cent,clusters,err

Solution

  • I noticed two things:

    1. your loop while err != 0 will most likely never be reached. Usually the user will set a error threshold such that when the actual error is below that value, the loop will exit. In Sklearn's Kmeans documentation, you can see this in the tol parameter.
    2. Your second for loop assumes that each cluster will have some points assigned to it. This may not be the case. For example, I ran your code with the following input [(1,100),(1,100),(100,100)], 2. You would think that the algorithm would converge into two clusters of the first two points, and the last point.
      But when the algorithm first initialized random cluster centers, it assigned [[29,78],[62,25]]. In this case, all my points got assigned into cluster 0 at first.
      So, when your second loop went through all the values of range(k), there was no points for cluster 1, which is why you might see the nan values in your output.

    You may want to look at other cluster center initialization algorithms like k-means++

    hope that helps!