Search code examples
matlabhierarchical-clusteringlinkage

Hierarchical Clustering in MATLAB


I have clustered my data X with hierarchical clustering in the following way:

X = [1 1 1;
     2 2 2;
     1 1 0;
     1 2 2];
Y = pdist(X);
T = linkage(Y, 'complete');
c = cluster(T,'maxclust',2);

So, X(1,:) and X(3,:) belongs to cluster #1 and others belongs to cluster #2.

How can I determine to which cluster a new data point (not in X) should be assigned to? For Example [1 0 1] belongs to which cluster?


Solution

  • Simple solution would be to find the nearest cluster centroid.

    Nearest Centroid

    x_new = [1 0 1];
    
    % Find cluster centroid
    X_c = zeros(numel(unique(c)), size(X,2));
    for cid = unique(c)'
       X_c(cid,:) = mean(X(c == cid,:)); 
    end
    
    % Find closest centroid
    [~,c_new] = min(pdist2(x_new,X_c));
    

    If you have more samples and want to factor in variance you can calculate z-score of euclidean distances

    Z-score of distances

    x_new = [1 0 1];
    X_means = zeros(1,numel(unique(c)));
    X_stds = zeros(1,numel(unique(c)));
    X_c = zeros(numel(unique(c)), size(X,2));
    for cid = unique(c)'
       distances = pdist2(X(c == cid,:),mean(X(c == cid,:))); 
       X_means(cid) = mean(distances);
       X_stds(cid) = std(distances);
    
       X_c(cid,:) = mean(X(c == cid,:)); 
    end
    [~,c_new] = min((pdist2(x_new,X_c) - X_means)./X_stds);
    

    If you want to factor in component variance you can take the Z-score of component distances (I'm not sure that this result is any different than the above...)

    Mean Z-score of component distances

    x_new = [1 0 1];
    X_means = zeros(numel(unique(c)),size(X,2));
    X_stds = zeros(numel(unique(c)),size(X,2));
    X_c = zeros(numel(unique(c)), size(X,2));
    for cid = unique(c)'
       comp_distances = abs(X(c == cid,:) - repmat(mean(X(c == cid,:)),[numel(find(c==cid)),1])); 
       X_means(cid,:) = mean(comp_distances);
       X_stds(cid,:) = std(comp_distances);
    
       X_c(cid,:) = mean(X(c == cid,:)); 
    end
    [~,c_new] = min(mean(((repmat(x_new,[size(X_c,1),1])-X_c) - X_means)./X_stds,2));