Search code examples
matlabcluster-computinghamming-distance

Why does clustering by hamming distance in MATLAB give centroids in decimal?


    X=[1 0 1 0 0 1 1 1;
       0 0 0 1 1 0 1 0;
       1 1 0 1 0 1 0 1;
       1 0 1 0 1 0 1 0;
       0 0 0 0 1 1 1 0;  
       1 1 1 0 0 0 1 1;
       1 0 1 0 1 1 1 0;
       0 1 0 1 1 0 1 1];

    [IDX,C] = kmeans(X,3, 'distance', 'hamming')

I wanted to test how to cluster binary data using hamming distance So in the code above I've randomly allotted X a matrix of binary values. The problem however, is that my centroids are in decimal values. Like I've shown below.

C=
    1.0000    1.0000    1.0000         0         0    1.0000    1.0000    1.0000
         0    0.5000         0    1.0000    1.0000         0    1.0000    0.5000
    1.0000         0    0.5000         0    1.0000    1.0000    1.0000         0

Why is there a 0.5 in the answer? I want the centroids to be binary too. Also is it possible to plot the clusters without the overlap because of binary data?


Solution

  • A centroid is an imaginary point (imaginary in the sense that it is not necessarily one of the data points), which is the geometric center for the corresponding data cluster. Think of it as its "center of mass".

    Centroids very often fall in between the points in the cluster. Therefore, if your data points are binary, it is expected that the coordinates of the centroids would not be integers.

    If you want the centroid coordinates to be binary as well, the simplest solution would involve applying a rounding function such as round, ceil, floor or fix.