Search code examples
rdataframetooltippca

name the samples in a PCA plot


I have a PCA plot with a lot of data and I want to identify which samples are the outliers. When I use

geom.ind = c("text")

then there is so much text that I can´t read anything.

Here is a minimal reproducible example. (I already used it here tooltip with names in a PCA plot but the answer only works manually and I really have a great dataframe)

dataframe <- data_frame("c1"=c(78,89,0),"c2"=c(89,89,34),"c3"=c(56,0,4))
row.names(dataframe) <- c("name1","name2","name3")

sub <- PCA(dataframe)

pca <- fviz_pca_ind(sub, pointsize = "cos2", 
             pointshape = 21, fill = "#E7B800",
             repel = TRUE, # Avoid text overlapping (slow if many points)
             geom = c("text","point"), 
             xlab = "PC1", ylab = "PC2",label = row.names(dataframe)
             )

interactive <- ggplotly(pca,dynamicTicks = T,tooltip = c("x","y",label = list))

As you can see, I treid to do it with ggplotly() function but that does not work.

I want to identify the sample name (name1,name2,name3) in my plot. How can I do this for a great dataset?

Thank you so much in advance


Solution

  • You can use the following code

    library(tidyverse)
    library("factoextra")
    library(plotly)
    library(FactoMineR)
    
    dataframe <- data_frame("c1"=c(78,89,0),"c2"=c(89,89,34),"c3"=c(56,0,4))
    row.names(dataframe) <- c("name1","name2","name3")
    
    sub <- PCA(dataframe)
    
    pca <- fviz_pca_ind(sub, pointsize = "cos2", 
                        pointshape = 21, fill = "#E7B800",
                        repel = TRUE, # Avoid text overlapping (slow if many points)
                        geom = c("text","point"), 
                        xlab = "PC1", ylab = "PC2",label = c("ind")
    )
    
    interactive <- ggplotly(pca,tooltip = c("x","y","colour"))
    
    bggly <- plotly_build(interactive)
    bggly$x$data[[1]]$text <- 
      with(pca$data, paste0("name: ", name, 
                            "</br></br>x: ", x, 
                            "</br>y: ", y, 
                            "</br>coord: ", coord, 
                            "</br>cos2: ", cos2, 
                            "</br>contrib: ", contrib))
    bggly
    

    After taking help from this post by Stéphane Laurent. For large dataset in .csv format with 1st column as row names, you can read in it as df <- read.csv("Test_Data.csv", row.names = 1), provided your row names are not duplicated.