Search code examples
rggplot2pcadimensionality-reduction

R: PCA ggplot Error "arguments imply differing number of rows"


I have a dataset: https://docs.google.com/spreadsheets/d/1ZgyRQ2uTw-MjjkJgWCIiZ1vpnxKmF3o15a5awndttgo/edit?usp=sharing

that I'm trying to apply PCA analysis and to achieve a graph based on graph provided in this post:

https://stats.stackexchange.com/questions/61215/how-to-interpret-this-pca-biplot-coming-from-a-survey-of-what-areas-people-are-i

However, an error doesn't seem to go away:

 Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = 
 TRUE,  : 
 arguments imply differing number of rows: 0, 1006

Following is my code that I have trouble finding the source of error. Would like to have some help for error detection. Any hints? The goal is to produced a PCA graph grouped by levels of Happiness.in.life. I modified the original code to fit with my dataset. Originally, group is determined by Genders, which has 2 levels. What I'm attempting to do is to build a graph based on 5 levels of Happiness.in.life. However, it doesn't seem I can use the old code...

Thanks!

library(magrittr)
library(dplyr)
library(tidyr)
df <- happiness_reduced %>% dplyr::select(Happiness.in.life:Internet.usage, Happiness.in.life)  
head(df)
vars_on_hap <- df %>% dplyr::select(-Happiness.in.life)
head(vars_on_hap) 
group<-df$Happiness.in.life

fit <- prcomp(vars_on_hap)
pcData <- data.frame(fit$x)
vPCs <- fit$rotation[, c("PC1", "PC2")] %>% as.data.frame()

multiple <- min( 
(max(pcData[,"PC1"]) - min(pcData[,"PC1"]))/(max(vPCs[,"PC1"])-
min(vPCs[,"PC1"])), 
(max(pcData[,"PC2"]) - min(pcData[,"PC2"]))/(max(vPCs[,"PC2"])-
 min(vPCs[,"PC2"])) 
)

ggplot(pcData, aes(x=PC1, y=PC2)) + 
geom_point(aes(colour=groups))   + 
coord_equal() + 
geom_text(data=vPCs, 
        aes(x = fit$rotation[, "PC1"]*multiple*0.82, 
            y = fit$rotation[,"PC2"]*multiple*0.82, 
            label=rownames(fit$rotation)), 
        size = 2, vjust=1, color="black") +
geom_segment(data=vPCs, 
           aes(x = 0, 
               y = 0,
               xend = fit$rotation[,"PC1"]*multiple*0.8, 
               yend = fit$rotation[,"PC2"]*multiple*0.8), 
           arrow = arrow(length = unit(.2, 'cm')), 
           color = "grey30")

Solution

  • Here is an approach on how to plot the result of PCA in ggplot2:

    library(tidyverse)
    library(ggrepel)
    

    A good idea (not in all cases for instance if they are all in the same units) is to scale the variables prior to PCA

    hapiness %>% #this is the data from google drive. In the future try not top post such links on SO because they tend to be unusable after some time has passed
      select(-Happiness.in.life) %>%
      prcomp(center = TRUE, scale. = TRUE) -> fit
    

    Now we can proceed to plotting the fit:

    fit$x %>%  #coordinates of the points are in x element
      as.data.frame()%>% #convert matrix to data frame
      select(PC1, PC2) %>%  #select the first two PC
      bind_cols(hapiness = as.factor(hapiness$Happiness.in.life)) %>% #add the coloring variable
      ggplot() + 
      geom_point(aes(x = PC1, y = PC2, colour = hapiness)) + #plot points and color
      geom_segment(data = fit$rotation %>% #data we want plotted by geom_segment is in rotation element
               as.data.frame()%>%
               select(PC1, PC2) %>%
               rownames_to_column(), #get to row names so you can label after
               aes(x = 0, y = 0, xend = PC1 * 7,  yend = PC2* 7,  group = rowname), #I scaled the rotation by 7 so it fits in the plot nicely
                   arrow = arrow(angle = 20, type = "closed", ends = "last",length = unit(0.2,"cm")), 
                   color = "grey30") +
      geom_text_repel(data = fit$rotation %>%
                        as.data.frame()%>%
                        select(PC1, PC2) %>%
                        rownames_to_column(),
                      aes(x = PC1*7,
                          y = PC2*7,
                          label = rowname)) +
      coord_equal(ratio = fit$sdev[2]^2 / fit$sdev[1]^2) + #I like setting the ratio to the ratio of eigen values 
      xlab(paste("PC1", round(fit$sdev[1]^2/ sum(fit$sdev^2) *100, 2), "%")) +
      ylab(paste("PC2", round(fit$sdev[2]^2/ sum(fit$sdev^2) *100, 2), "%")) +
      theme_bw()
    

    enter image description here

    Look at all them happy people on the left (well it is hard to notice because of the colors used, I suggest using the palette jco from ggpubr library) get_palette('jco', 5) ie scale_color_manual(values = get_palette('jco', 5))

    quite a similar plot can be achieved with library ggord:

    library(ggord)
    
    ggord(fit, grp_in = as.factor(hapiness$Happiness.in.life),
          size = 1, ellipse = F, ext = 1.2, vec_ext = 5)
    

    enter image description here

    the major difference is ggord uses equal scaling for axes. Also I scaled the rotation by 5 instead of 7 as in the first plot.

    As you can see I do not like many intermediate data frames.