Search code examples
matlabmatrixcluster-analysisk-means

k-means clustering using function 'kmeans' in MATLAB


I have this matrix:

x = [2+2*i 2-2*i -2+2*i -2-2*i];

I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:

A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);

At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).

X = [A A A C C A C C];

Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:

  1. I am not sure if I can represent the complex numbers as shown in X above.
  2. I can't plot the result of the kmeans to show the clusters.
  3. I could not understand the clusters centroid results.
  4. How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?

Solution

  • If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.

    How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.


    Now, let's answer your questions one-by-one.

    1. Complex number representation

    k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.

    See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306

    2. Plot the results

    You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:

    X = [A; A; A; C; C; A; C; C];
    

    Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.

    Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:

    [idx, centroids] = kmeans(X, 4);
    

    idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:

    x = X(:,1);
    y = X(:,2);
    figure;
    hold on;
    colors = 'rgbk';
    for num = 1 : 4
        plot(x(idx == num), y(idx == num), [colors(num) '.']);
    end
    
    plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
    grid;
    

    The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.

    This is what I get:

    enter image description here

    3. Understanding centroid results

    This is probably because you didn't construct X properly. This is what I get for my centroids:

    centroids =
    
       -1.9176   -2.0759
        1.5980    2.8071
        2.7486    1.6147
        0.8202    0.8025
    

    This is pretty self-explanatory and I talked about how this is structured earlier.

    4. Best representation of the signal

    What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:

    [idx, centroids] = kmeans(X, 4, 'Replicates', 5);
    

    This will run kmeans 5 times and give you the best centroids of these 5 times.

    Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:

    for num = 1 : 8
        %// Look at 150 points at a time
        [idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
    
        %// Do your analysis
        %//...
        %//...
    end
    

    idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.


    If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.

    Something like:

    colors = 'rgbk';
    [X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
    X = X(:);
    Y = Y(:);
    figure;
    hold on;
    for idx = 1 : numel(X)
        [~,ind] = min(sum(bsxfun(@minus, [X(idx) Y(idx)], centroids).^2, 2));
        plot(X(idx), Y(idx), [colors(ind), '.']);
    end
    plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
    

    The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.

    Here's what the decision regions look like:

    enter image description here


    Hope this helps! Good luck!