Search code examples
pythonmatlabcluster-analysisk-means

Kmeans with initial centroids give different outputs in Matlab and Python environment


The input to the Kmeans in both the Matlab and Python environments is the following list:

input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64]

Matlab:

[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);

Output

C = [0.596, 0.825, 1.035]

(idx==1) = 15, (idx==2) = 6, (idx==3) = 6

Python:

import numpy as np
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_

Output

C = [0.430, 0.969, 0.637]

(idx==0) = 2, (idx==1) = 10, (idx==2) = 15

Clearly, the output centroids and the number of input points classified in the 3 clusters are different for these environments. What is the reason behind this even when the initial centroids are the same?


Solution

  • I've writed a minimal kmeans algorithm to test your dataset with matlab:

    input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 
             0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 
             0.64];
    
    c     = [0.3;0.9;1.5]
    
    for ii = 1:10
        [~,idx] = min(abs(c-input));         % pairwise euclidian distance
        c = accumarray(idx.',input,[],@mean) % compute the new centroid
    end
    

    After the first iteration the index idx, that indicate which is the closest centroid for each value, looks like this:

     2   2   2   2   2   2   2   1   1   2...
    

    The last centroid (1.5 here) is NEVER the closest value ! So in order to keep 3 groups the kmeans algorithm have to compute, somehow, a new value for this centroid (because it's hard to compute the mean of an empty set). And it looks like python and matlab have different implementation for it.

    If you want to avoid this problem make sure that every initial centroid is the closest value for, at least, one element of your dataset.

    You can, for example, take the first three differents value of your dataset.