Search code examples
rplotsubsetpca

Plotting select PCA loadings in R


I have just performed a PCA analysis for a large data set with approximately 20,000 variables. To do so, I used the following code:

df_pca <- prcomp(df, center=FALSE, scale.=TRUE)

I am curious how my variables influenced PCA.1 (Dimension 1 of the PCA analysis) and PCA.2 (Dimension 2 of the PCA analysis).

I used the following code to look at how each variable influenced the dimensional analysis:

fviz_pca_var(df_pca, col.var = "black")

However, this creates a graph with all 20,000 of my variables and since there is so much information, it is unreadable.

Is there a way to select the variables that have most influenced PCA.1 and PCA.2 and graph only those?

Thank you in advance!


Solution

  • What you want to do is first get the actual table that correlates the synthetic variable w/ the real variables. Do that like this:

    a <- df_pca$rotation
    

    Then we can use dplyr to manipulate the data frame and extract what we want:

    library(dplyr)
    library(tibble)
    a %>% as.data.frame %>% rownames_to_column %>% 
    select(rowname, PC1, PC2) %>% arrange(desc(PC1^2+PC2^2)) %>% head(10)
    

    The above will organize show the top 10 most important variables for PC1 and PC2. You can run the same thing for PC1 only by changing to arrange(desc(abs(PC1))), or PC2 by changing to arrange(desc(abs(PC2)))... and see more or less than 10 variables by changing head(10).