Search code examples
selectionpca

How to extract row samples from a PCA analysis


I am running the ggbiplot package to run a PCA analysis of my data. Data are organized as rownames as the names of the samples and 4 columns with data.

But there are many rows, more than 1000.

When running ggbiplot I get this graphic shown below, which is nicely separating my data [PCA analysis by means of ggbiplot[1]

As you can see samples names are stuck together so they are not easily recognized, and I would like to extract the rownames containing every sample of these 9 groups to get an idea of what is separating these data. One idea is to extract data using a determined range of the X and Y axis

Is there any way to get it? ggbiplot is working with a "prcomp" class file


Solution

  • PCA help visualize data along principal axis along the direction of maximum variance. Therefore, detecting clusters becomes easier (like in your biplot).

    But to identify a data point/ row to a particular cluster, you need to run a clustering algorithm. As your data seems to have non-overlapping clusters any clustering algorithm should do. But, as you already know how many clusters you need and have a certain idea about cluster centers along the principal axis, I would recommend going for run K-means algorithm (k = 9 for your analysis) and it will provide you an integer vector specifying which data point belong to which of the 9 clusters.

    It should easily work even if you run a K-means directly on the PCA scores as you have initial guess about the centeroids from the above biplot.