The input to the Kmeans in both the Matlab and Python environments is the following list:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64]
Matlab:
[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);
Output
C = [0.596, 0.825, 1.035]
(idx==1) = 15, (idx==2) = 6, (idx==3) = 6
Python:
import numpy as np
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_
Output
C = [0.430, 0.969, 0.637]
(idx==0) = 2, (idx==1) = 10, (idx==2) = 15
Clearly, the output centroids and the number of input points classified in the 3 clusters are different for these environments. What is the reason behind this even when the initial centroids are the same?
I've writed a minimal kmeans algorithm to test your dataset with matlab:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59,
0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90,
0.64];
c = [0.3;0.9;1.5]
for ii = 1:10
[~,idx] = min(abs(c-input)); % pairwise euclidian distance
c = accumarray(idx.',input,[],@mean) % compute the new centroid
end
After the first iteration the index idx
, that indicate which is the closest centroid for each value, looks like this:
2 2 2 2 2 2 2 1 1 2...
The last centroid (1.5
here) is NEVER the closest value ! So in order to keep 3 groups the kmeans
algorithm have to compute, somehow, a new value for this centroid (because it's hard to compute the mean of an empty set). And it looks like python and matlab have different implementation for it.
If you want to avoid this problem make sure that every initial centroid is the closest value for, at least, one element of your dataset.
You can, for example, take the first three differents value of your dataset.