Search code examples

R: Using data.frame information to colour points on a scatter plot

I have generated a scatter plot of my data using plot(data$pco$li[,1], data$pco$li[,2]). The result is a PCA scatter output. I now want to colour each point on the scatter according to it's category (each point is a gene and I want to colour it according to the chromosome to which it belongs).

I have a file ready with genes in column one and chromosome in column two, and have loaded it into R using:

geneLoc <- read.table(file = "~/Location/File.txt", header = FALSE, sep = "\t")
colnames(geneLoc) <- c("Gene", "Chromosome")

From here I do not know how to use this information to colour the points on the scatter plot. The closest answer I found was here: Colouring scatter graph by type in r

However, my data for the scatter is not in the form of a two column table (as it is the result of a package called Treescape that conducts PCA). It is therefore in this format:

          gene1    gene2    gene3    gene4    gene5    gene6    gene7    gene8    gene9
gene2  33.76389                                                                        
gene3  51.12729 47.74935                                                               
gene4  27.62245 31.38471 52.12485                                                      
gene5  33.92639 28.44293 53.74942 28.67054                                             
gene6  32.28002 26.57066 43.72642 29.54657 25.51470                                    
gene7  34.65545 30.08322 54.06478 30.59412 24.89980 27.00000                           
gene8  31.09662 27.44085 48.89785 27.49545 26.87006 24.59675 26.79552                  
gene9  36.20773 28.82707 50.94114 31.24100 24.53569 24.06242 25.41653 27.60435         
gene10 36.53765 28.75761 53.86093 30.46309 23.62202 25.00000 27.82086 28.87906 25.33772

As such I wouldn't simply be able to add a third category column to a two column data frame and use that to colour my scatter.


  • You need to convert your data into the following format:

    Var1      Var2      Value
    gene1     gene2     33.76389
    gene1     gene3     51.12729

    You can then easily append a 4th column. The package reshape2 has a function called melt, which will do the trick. First, let's generate a similar matrix to your above example:

    mydata <- matrix(data=rnorm(81, 25, 10), ncol=9, nrow=9)
    colnames(mydata) <- paste0("gene", 1:9)
    rownames(mydata) <- paste0("gene", 2:10)
    mydata[upper.tri(mydata, diag=T)] <- NA

    Now we can use reshape2 to turn this into "long" format I described above:

    meltdata <- melt(mydata)

    You can now append a column to the right of meltdata for plotting. The ggplot2 library is good at plotting data structured in this format.