code based on this article works as expected when I define only 3 clusters. But when I change the number of clusters, I do not get the equal number of clusters back.
from matplotlib import image as img
from matplotlib import pyplot as plt
import pandas as pd
image = img.imread("my_logo1.jpg")
image.shape
r = []
g = []
b = []
for line in image:
for pixel in line:
temp_r, temp_g, temp_b = pixel
r.append(temp_r / 255)
g.append(temp_g / 255)
b.append(temp_b / 255)
df = pd.DataFrame({"red": r, "green": g, "blue": b})
from scipy.cluster.vq import kmeans
cluster_centers, distortion = kmeans(df[["red", "green", "blue"]], 7)
print(cluster_centers)
cluster centers returned are only 3, expected 7
I expected the same number of colors to return back as defined in the kmeans function.
Reading source code for kmeans()
function, you can note the use of a supporting function _kmeans()
, where you can find:
code_book = code_book[has_members]
has_members
is a boolean array indicating which clusters have members, resulting from _vq.update_cluster_means()
.
In short, when you specify the number of clusters k
, the algorithm returns a set of centroids (at most k
) with the lowest distortion seen. Empty clusters are simply removed during the update-step of K-means.