matlab matrix cluster-analysis fuzzy-logic

MATLAB - How to compare and assign a value to a cluster from a dataset?

Hello StackOverflow community,

I'm having a hard time wrapping my head around a problem I'm having in MATLAB.

I have a matrix that looks like this:

enter image description here

This is a clustered table from a very large dataset.

I have a secondary table which is also very large and is 5000x4. This second table contains only integers. How do I make the software compare the values from columns 1 through 3 in this secondary table with the values from the first table and then make the code decide which cluster the values from the second table belong in based on which combination of values its closest to?

For example, the secondary table has a row with values 141, 162, 239, 1. By looking at it, I can tell that it belongs in row 1 of the cluster table. But I cant go through thousands of rows checking it manually.

Column 4 can be disregarded for now as it will be used for other purposes. If I am somehow unclear in the question, please let me know, I have a hard time explaining in English. Any advice will be appreciated.

Solution

You could cluster in terms of minimal L2 distance:

d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').'
[~,ic] = min(d,[],1)

The variable ic contains the cluster number (closest row's index into B) for each row of A.

(Trim off column 4 and then compute the above.)

Example with 4 columns:

>> B = randi(255,3,4)

B =

   255   164   195   120
    59    27   206    56
   235    69    27   236

>> A = B(randi(3,10,1),:) + randi(20,10,4) - 10

A =

   259   163   195   116
   226    61    25   228
   255   160   195   121
    69    29   210    62
   248   167   205   116
   260   173   187   115
    62    37   212    53
   237    61    29   236
   255   168   204   125
   237    72    20   237

>> d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').';
>> [~,ic] = min(d,[],1)
ic =

     1     3     1     2     1     1     2     3     1     3

You can also use pdist2 with any other distance metric you like, or use bsxfun with the more familiar formulation:

d = squeeze(sqrt(sum(bsxfun(@minus,A,permute(B,[3 2 1])).^2,2)));

Or kmeans...

Reference 1 and 2.