Search code examples
matlabcluster-analysisk-meansnmi

Why the NMI value is small while having higher clustering accuracy and Rand index in clustering


I am using the https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information).

Actually I have two Nx1 vectors as inputs, one is the actual label while another is the label assignments. I basically check each of every element insides and I found that even I have 82% rand_index, the NMI is only 0.3209. Below is the example for Iris Dataset https://archive.ics.uci.edu/ml/datasets/iris with MATLAB built-in K-Means.

data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);

NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');

The result:

Auto ACC: 0.820000 Rand_Index: 0.701818 NMI: 0.320912


Solution

  • The Rand Index will tend towards 1 as the number of data points increases (even when comparing random clusterings) so you never really expect to see small values of Rand when you have a big data set.

    At the same time, Accuracy can be high when all of your points fall into the same large cluster.

    I have a feeling that the NMI is producing a more reliable comparison. To verify, trying running a dimensionality reduction and plot the data points with color based on the two clusterings. Visual statistics are often the best for developing an intuition about data.

    If you want to explore more, a convenient python package for clustering comparisons is CluSim.