Search code examples
matlabmatrixcluster-analysisfuzzy-logic

MATLAB - How to compare and assign a value to a cluster from a dataset?


Hello StackOverflow community,

I'm having a hard time wrapping my head around a problem I'm having in MATLAB.

I have a matrix that looks like this:

enter image description here

This is a clustered table from a very large dataset.

I have a secondary table which is also very large and is 5000x4. This second table contains only integers. How do I make the software compare the values from columns 1 through 3 in this secondary table with the values from the first table and then make the code decide which cluster the values from the second table belong in based on which combination of values its closest to?

For example, the secondary table has a row with values 141, 162, 239, 1. By looking at it, I can tell that it belongs in row 1 of the cluster table. But I cant go through thousands of rows checking it manually.

Column 4 can be disregarded for now as it will be used for other purposes. If I am somehow unclear in the question, please let me know, I have a hard time explaining in English. Any advice will be appreciated.


Solution

  • You could cluster in terms of minimal L2 distance:

    d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').'
    [~,ic] = min(d,[],1)
    

    The variable ic contains the cluster number (closest row's index into B) for each row of A.

    (Trim off column 4 and then compute the above.)

    Example with 4 columns:

    >> B = randi(255,3,4)
    
    B =
    
       255   164   195   120
        59    27   206    56
       235    69    27   236
    
    >> A = B(randi(3,10,1),:) + randi(20,10,4) - 10
    
    A =
    
       259   163   195   116
       226    61    25   228
       255   160   195   121
        69    29   210    62
       248   167   205   116
       260   173   187   115
        62    37   212    53
       237    61    29   236
       255   168   204   125
       237    72    20   237
    
    >> d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').';
    >> [~,ic] = min(d,[],1)
    ic =
    
         1     3     1     2     1     1     2     3     1     3
    

    You can also use pdist2 with any other distance metric you like, or use bsxfun with the more familiar formulation:

    d = squeeze(sqrt(sum(bsxfun(@minus,A,permute(B,[3 2 1])).^2,2)));
    

    Or kmeans...

    Reference 1 and 2.