Search code examples
pythonpython-3.xmachine-learningscipyk-means

kmeans cluster number does not match with k value


code based on this article works as expected when I define only 3 clusters. But when I change the number of clusters, I do not get the equal number of clusters back.

from matplotlib import image as img
from matplotlib import pyplot as plt
import pandas as pd

image = img.imread("my_logo1.jpg")
image.shape

r = []
g = []
b = []

for line in image:
    for pixel in line:
        temp_r, temp_g, temp_b = pixel
        r.append(temp_r / 255)
        g.append(temp_g / 255)
        b.append(temp_b / 255)

df = pd.DataFrame({"red": r, "green": g, "blue": b})

from scipy.cluster.vq import kmeans
cluster_centers, distortion = kmeans(df[["red", "green", "blue"]], 7)

print(cluster_centers)

cluster centers returned are only 3, expected 7

I expected the same number of colors to return back as defined in the kmeans function.


Solution

  • Reading source code for kmeans() function, you can note the use of a supporting function _kmeans(), where you can find:

    code_book = code_book[has_members]
    

    has_members is a boolean array indicating which clusters have members, resulting from _vq.update_cluster_means().

    In short, when you specify the number of clusters k, the algorithm returns a set of centroids (at most k) with the lowest distortion seen. Empty clusters are simply removed during the update-step of K-means.