Search code examples
rggplot2pca

Is there a way to add a cum sum to a fviz_eig plot?


I'm trying to realize a nice plot of PC together with the cumulative variance explained. The dataframe I'm working on is available at https://www.kaggle.com/miroslavsabo/young-people-survey?select=responses.csv

df.responses <- read.csv("Data/responses.csv")
pref <- colnames(df.responses[0:63]) #columns for Music, Movies and Hobbies preferences
for(i in 1:length(pref)){
  df.responses[is.na(df.responses[,i]), i] <- median(df.responses[,i], na.rm = TRUE)
}
df.movies <- data.frame(df.responses[20:31])

Above I just loaded the df, removed the na for the cols I'm interested in and selected the subset I want to PCA.

library(ggplot2)
library(factoextra)

pca.movies <- prcomp(df.movies, scale = TRUE,)
pca.movies$rotation <- -pca.movies$rotation
pca.movies$x <- -pca.movies$x

fviz_pca_var(pca.movies,
             col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE   
)

pv.movies <- pca.movies$sdev^2 
pvp.movies <- pv.movies/sum(pv.movies)

pvp.movies

fviz_eig(pca.movies,
         addlabels = T, 
         barcolor = "#E7B800", 
         barfill = "#E7B800", 
         linecolor = "#00AFBB", 
         choice = "variance", 
         ylim=c(0,25))

plot(cumsum(pvp.movies),xlab = "Cumulative proportion of Variance Explained", ylim=c(0,1),type = 'b') 

With the above I managed to obtain two nice plots for PCA, I would like to add to the second plot the line of cumulative sum (the one showed in the third ugly plot) Is there a way to add such line to the fviz_eig plot? I know this PCA is not really efficient, I'm just challenging myself with some dataviz.


Solution

  • The object returned by fviz_eig is a ggplot object, thus you can merge the two plots as follows:

    p <- fviz_eig(pca.movies,
             addlabels = T, 
             barcolor = "#E7B800", 
             barfill = "#E7B800", 
             linecolor = "#00AFBB", 
             choice = "variance", 
             ylim=c(0,25))
    
    df <- data.frame(x=1:length(pvp.movies),
                     y=cumsum(pvp.movies)*100/4)
    p <- p + 
         geom_point(data=df, aes(x, y), size=2, color="#00AFBB") +
         geom_line(data=df, aes(x, y), color="#00AFBB") +
         scale_y_continuous(sec.axis = sec_axis(~ . * 4, 
                                       name = "Cumulative proportion of Variance Explained") )
    print(p)
    

    enter image description here