I am working on a clutering problem.
I have a set of 100 observatons
. Each observation is described by 3 features.
I have to cluster these observations in 2 groups (I have a label for each observation).
Before clustering the observations I computed first the pdist
between observations and then I used the mdscale function in MATLAB to go back to 3 dimensions.
I used the transformed_observation
as input of a kmean clustering algorithm getting better clustering results (i.e. the clusters match with the labels) if compared to using the original observations.
Anyone can explain me why??? I just tried...
Here you can find my steps...
% select the dimensions of my features
dimensions = 3;
% generate an example data set
observations = rand(100,dimensions);
% if yes use the combination of pdist + mdscale
use_dissimilarity = 'yes';
if strcmp(use_dissimilarity,'yes')
%compute pdist between features
dissimilarity = pdist(observations,@kullback_leibler_divergence);
%re-transform features in 3 dimensions
transformed_observations = mdscale(dissimilarity,dimensions);
else
transformed_observations = observations;
end
%cluster observations
numbercluster = 2;
[IDX, clustercentroids] = kmeans(transformed_observations, numbercluster,...
'emptyaction','singleton',...
'replicates',11,'display','off');
pdist
computes the pairwise distances (using KL-Divergence).
mdscale
(Multidimensional scaling) will now try to embed the distances in an Euclidean vector space, such that they are best preserved.
K-means only works with squared Euclidean distances (and a few other Bregman divergences).
So it is in my opinion an error that Matlab allows you a few other distances:
'sqeuclidean' (default) | 'cityblock' | 'cosine' | 'correlation' | 'hamming'
It is not surprising that this worked better if KL-Divergence is appropriate for your data set, because this construct allows using k-means on (an approximation of) KL-Divergence.