Hello StackOverflow community,
I'm having a hard time wrapping my head around a problem I'm having in MATLAB.
I have a matrix that looks like this:
This is a clustered table from a very large dataset.
I have a secondary table which is also very large and is 5000x4. This second table contains only integers. How do I make the software compare the values from columns 1
through 3
in this secondary table with the values from the first table and then make the code decide which cluster the values from the second table belong in based on which combination of values its closest to?
For example, the secondary table has a row with values 141, 162, 239, 1
. By looking at it, I can tell that it belongs in row 1
of the cluster table. But I cant go through thousands of rows checking it manually.
Column 4
can be disregarded for now as it will be used for other purposes. If I am somehow unclear in the question, please let me know, I have a hard time explaining in English. Any advice will be appreciated.
You could cluster in terms of minimal L2 distance:
d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').'
[~,ic] = min(d,[],1)
The variable ic
contains the cluster number (closest row's index into B
) for each row of A
.
(Trim off column 4 and then compute the above.)
Example with 4 columns:
>> B = randi(255,3,4)
B =
255 164 195 120
59 27 206 56
235 69 27 236
>> A = B(randi(3,10,1),:) + randi(20,10,4) - 10
A =
259 163 195 116
226 61 25 228
255 160 195 121
69 29 210 62
248 167 205 116
260 173 187 115
62 37 212 53
237 61 29 236
255 168 204 125
237 72 20 237
>> d = sqrt(bsxfun(@plus, sum(A.*A,2), sum(B.*B,2)') - 2 * A*B.').';
>> [~,ic] = min(d,[],1)
ic =
1 3 1 2 1 1 2 3 1 3
You can also use pdist2
with any other distance metric you like, or use bsxfun with the more familiar formulation:
d = squeeze(sqrt(sum(bsxfun(@minus,A,permute(B,[3 2 1])).^2,2)));
Or kmeans
...
Reference 1 and 2.