Search code examples
algorithmk-meanspca

Select relevant features with PCA and K-MEANS


I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.

I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf

I have csv files of pepole walks composed as follow:

  • TIME, X, Y, Z, these values are registred by the accelerometer

What I did

  • I transformed the dataset as a table in Python
  • I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
  • I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of vectors features

In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:

  • How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?

After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:

  • If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?

I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!


Solution

  • For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.

    So the first feature/element in space A', is a linear combination of features of A.

    The row/column depends on implementation, but if you use scikit PCA the columns are the features.

    You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.

    Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.

    The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.