Search code examples
matlabscalecluster-analysispdist

Understanding the use of pdist in combination with mdscale


I am working on a clutering problem.

I have a set of 100 observatons. Each observation is described by 3 features. I have to cluster these observations in 2 groups (I have a label for each observation).

Before clustering the observations I computed first the pdist between observations and then I used the mdscale function in MATLAB to go back to 3 dimensions. I used the transformed_observation as input of a kmean clustering algorithm getting better clustering results (i.e. the clusters match with the labels) if compared to using the original observations. Anyone can explain me why??? I just tried...

Here you can find my steps...

% select the dimensions of my features
dimensions = 3;

% generate an example data set
observations = rand(100,dimensions);

% if yes use the combination of pdist + mdscale
use_dissimilarity = 'yes';

if strcmp(use_dissimilarity,'yes')
  %compute pdist between features
  dissimilarity = pdist(observations,@kullback_leibler_divergence);
  %re-transform features in 3 dimensions                             
  transformed_observations = mdscale(dissimilarity,dimensions);
else
  transformed_observations = observations;
end

%cluster observations 
numbercluster = 2;
[IDX, clustercentroids] = kmeans(transformed_observations, numbercluster,...
                    'emptyaction','singleton',...
                    'replicates',11,'display','off');

Solution

  • pdist computes the pairwise distances (using KL-Divergence).

    mdscale (Multidimensional scaling) will now try to embed the distances in an Euclidean vector space, such that they are best preserved.

    K-means only works with squared Euclidean distances (and a few other Bregman divergences).

    So it is in my opinion an error that Matlab allows you a few other distances:

    'sqeuclidean' (default) | 'cityblock' | 'cosine' | 'correlation' | 'hamming'

    It is not surprising that this worked better if KL-Divergence is appropriate for your data set, because this construct allows using k-means on (an approximation of) KL-Divergence.