Search code examples
rsubsetpca

Coloring subsets in PCA biplot


I am working on a gene expression data frame called expression. My samples belong to different subgroups, indicated in the colname (i.e. all samples that contain "adk" in their colname belong to the same subgroup)

       adk1  adk2  bas1  bas2  bas3  non1  ...
gene1   1.1   1.3   2.2   2.3   2.8   1.6
gene2   2.5   2.3   4.1   4.6   4.2   1.9
gene3   1.6   1.8   0.5   0.4   0.9   2.2
...

I already defined subsets using

adk <- expression[grepl('adk', names(expression))]

I then did a PCA on this data set using

pca = prcomp (t(expression), center = F, scale= F)

I now want to plot the PCs I got from the PCA against each other in a PCA biplot. For this, I want all samples that belong to the same subgroup to have the same color (so e.g. all "adk" samples should be green, all "bas" samples should be red and all "non" samples should be blue). I tried to use the color argument of the autoplot function from ggfortify, but I wansn't able to make it use my defined subsets.

I would be glad if someone could help me with this! Thanks :)

Edit: I'd like to give you an example of what I want to do, using the USArrests dataset:

head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

## Doing a PCA on the USArrests dataset

US.pca = prcomp(t(USArrests), center = F, scale = F)

## Now I can create a PCA biplot of PC1 and PC2 using the autoplot function (since I have ggfortify installed)

biplot1 = autoplot(US.pca,data=t(USArrests), x=1, y=2)

I want all samples that contain an "e" in their colname (in this case "Murder" and "Rape") to be the same color. The "UrbanPop" and the "Assault" sample should be an individual color as well. I hope this makes things a little clearer :)

P.S. I run R in the latest version of RStudio on Windows 10


Solution

  • Welcome to SO! What about something like this, using ggbiplot package:

    # PCA
    pca <- prcomp (t(expressions), center = F, scale= F)
    # first you get the vector of the names
    # gr <- substr(rownames(t(expressions)),1,3)
    # EDIT
    gr <-gsub(".*(adk|bas|non).*$", "\\1",rownames(t(expressions)), ignore.case = TRUE)
    
    library(ggbiplot)
    # plot it
    ggbiplot(pca, groups = gr)+ 
      scale_color_manual(values=c("green", "red"," blue")) + 
      theme_light()
    

    enter image description here

    EDIT
    If you're using R 4.0.0, you'd install the package following this two lines:

    library(devtools)
    install_github("vqv/ggbiplot", force = TRUE)
    

    With data:

    expressions <- read.table(    text = "adk1  adk2  bas1  bas2  bas3  non1 
                                   gene1   1.1   1.3   2.2   2.3   2.8   1.6
                                   gene2   2.5   2.3   4.1   4.6   4.2   1.9
                                   gene3   1.6   1.8   0.5   0.4   0.9   2.2", header = T )