Coloring Rarefaction curve lines by metadata (vegan package) (phyloseq package)

First time question asker here. I wasn't able to find an answer to this question in other posts (love stackexchange, btw).

Anyway... I'm creating a rarefaction curve via the vegan package and I'm getting a very messy plot that has a very thick black bar at the bottom of the plot which is obscuring some low diversity sample lines. Ideally, I would like to generate a plot with all of my lines (169; I could reduce this to 144) but make a composite graph, coloring by Sample Year and making different types of lines for each Pond (i.e: 2 sample years: 2016, 2017 and 3 ponds: 1,2,5). I've used phyloseq to create an object with all my data, then separated my OTU abundance table from my metadata into distinct objects (jt = OTU table and sampledata = metadata). My current code:

 jt <- as.data.frame(t(j)) # transform it to make it compatible with the proceeding commands
rarecurve(jt
          , step = 100
          , sample = 6000
          , main = "Alpha Rarefaction Curve"
          , cex = 0.2
          , color = sampledata$PondYear)

# A very small subset of the sample metadata
                  Pond    Year
F16.5.d.1.1.R2     5      2016
F17.1.D.6.1.R1     1      2017
F16.1.D15.1.R3     1      2016
F17.2.D00.1.R2     2      2017

enter image description here

Solution

Here is an example of how to plot a rarefaction curve with ggplot. I used data available in the phyloseq package available from bioconductor.

to install phyloseq:

source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
library(phyloseq)

other libraries needed

library(tidyverse)
library(vegan)

data:

mothlist <- system.file("extdata", "esophagus.fn.list.gz", package = "phyloseq")
mothgroup <- system.file("extdata", "esophagus.good.groups.gz", package = "phyloseq")
mothtree <- system.file("extdata", "esophagus.tree.gz", package = "phyloseq")
cutoff <- "0.10"
esophman <- import_mothur(mothlist, mothgroup, mothtree, cutoff)

extract OTU table, transpose and convert to data frame

otu <- otu_table(esophman)
otu <- as.data.frame(t(otu))
sample_names <- rownames(otu)

out <- rarecurve(otu, step = 5, sample = 6000, label = T)

Now you have a list each element corresponds to one sample:

Clean the list up a bit:

rare <- lapply(out, function(x){
  b <- as.data.frame(x)
  b <- data.frame(OTU = b[,1], raw.read = rownames(b))
  b$raw.read <- as.numeric(gsub("N", "",  b$raw.read))
  return(b)
})

label list

names(rare) <- sample_names

convert to data frame:

rare <- map_dfr(rare, function(x){
  z <- data.frame(x)
  return(z)
}, .id = "sample")

Lets see how it looks:

head(rare)
  sample       OTU raw.read
1      B  1.000000        1
2      B  5.977595        6
3      B 10.919090       11
4      B 15.826125       16
5      B 20.700279       21
6      B 25.543070       26

plot with ggplot2

ggplot(data = rare)+
  geom_line(aes(x = raw.read, y = OTU, color = sample))+
  scale_x_continuous(labels =  scales::scientific_format())

vegan plot:

rarecurve(otu, step = 5, sample = 6000, label = T) #low step size because of low abundance

One can make an additional column of groupings and color according to that.

Here is an example how to add another grouping. Lets assume you have a table of the form:

groupings <- data.frame(sample = c("B", "C", "D"),
                       location = c("one", "one", "two"), stringsAsFactors = F)
groupings
  sample location
1      B      one
2      C      one
3      D      two

where samples are grouped according to another feature. You could use lapply or map_dfr to go over groupings$sample and label rare$location.

rare <- map_dfr(groupings$sample, function(x){ #loop over samples
  z <- rare[rare$sample == x,] #subset rare according to sample 
  loc <- groupings$location[groupings$sample == x] #subset groupings according to sample, if more than one grouping repeat for all
  z <- data.frame(z, loc) #make a new data frame with the subsets
  return(z)
})

head(rare)
  sample       OTU raw.read loc
1      B  1.000000        1 one
2      B  5.977595        6 one
3      B 10.919090       11 one
4      B 15.826125       16 one
5      B 20.700279       21 one
6      B 25.543070       26 one

Lets make a decent plot out of this

ggplot(data = rare)+
  geom_line(aes(x = raw.read, y = OTU, group = sample, color = loc))+
  geom_text(data = rare %>% #here we need coordinates of the labels
              group_by(sample) %>% #first group by samples
              summarise(max_OTU = max(OTU), #find max OTU
                        max_raw = max(raw.read)), #find max raw read
              aes(x = max_raw, y = max_OTU, label = sample), check_overlap = T, hjust = 0)+
  scale_x_continuous(labels =  scales::scientific_format())+
  theme_bw()