Search code examples
matlabmachine-learningdata-analysis

How to understand the Matlab build in function "kmeans"?


Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply Matlab build in function "kmeans"to the matrix A.

k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');

I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8; According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation. My question is:

1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?

2) What are the final centers c1, c2,...,ck? Are they just values or locations?

3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?

Thanks!


Solution

  • Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).

    • A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
    • k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.

    Now I will answer the three parts:

    1. The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
    2. This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
    3. The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:

      [idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
      

      D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.

    Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:

    [idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
    

    Edit:

    Here is the code necessary for part 3:

    [~, min_idxs] = min(D, [], 1);
    closest_vecs = A(min_idxs, :);
    

    Each row i of closest_vecs is the vector that is closest to centroid i.