Search code examples
rphyloseq

combine OTU and tax table and replace actual sequences with OTU ids (Phyloseq/dada2)


I was following the workflow described here https://f1000research.com/articles/5-1492/v2 using the sample data as well as my own data. This worked fine, but now I can't generate an OTU table, that contains a header such as "OTU00004" or even better "kingdom_phylum_..._Pseudomonas_OTU00004". I would like to use such a table to find and plot the abundance of a certain OTU over several samples.

I created an object called ps, that seems to be ok:

ps <- phyloseq(tax_table(taxtab), sample_data(samdf),
                 otu_table(seqtab, taxa_are_rows = FALSE),phy_tree(fitGTR$tree))    

> ps
    phyloseq-class experiment-level object
    otu_table()   OTU Table:         [ 454 taxa and 360 samples ]
    sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
    tax_table()   Taxonomy Table:    [ 454 taxa by 6 taxonomic ranks ]
    phy_tree()    Phylogenetic Tree: [ 454 tips and 452 internal nodes ]

but the headers in the OTU table and the corresponding rows in the taxonomy table are actual (here shortened) sequences

> head(otu_table(ps)[1])
     GCAAGCGTTACTCGGAATCACTGGGCGTAAAGAGCGCGTAGGCGG#shortened
F3D0                                             0

> head(tax_table(ps)[1])
Taxonomy Table:     [1 taxa by 6 taxonomic ranks]:
                                                         Kingdom
GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGA#shortened "Bacteria"

Is there a way to combine the information from otu table and taxonomy table and replace the sequences with numbered OTU ids? I have checked several phyloseq resources and FAQs, but I can't find an answer to this.

I would like to have a table looking like this:

        taxonomy_OTU00001   taxonomy_OTU00002   taxonomy_OTU00003
F3D0    #counts             #counts             #counts
F3D1    #counts             #counts             #counts
F3D11   #counts             #counts             #counts
F3D125  #counts             #counts             #counts

As the workflow until this step is quite time consuming, I'm not sure how to provide a reproducible example for this problem.

EDIT: I generated a sample subset following dww's suggestion.

short_otu2 = short_otu = head(otu_table(ps)[,c(1:6)])  # seq as colnames 
short_tax2 = short_tax = tax_table(ps)[colnames(short_otu), ]  # seq as rownames
# shorten seqs, must still be unique
colnames(short_otu2) <- substr(colnames(short_otu), 0, 50)
rownames(short_tax2) <- substr(rownames(short_tax), 0, 50)

library(phyloseq)
> dput(short_otu2)
new("otu_table", .Data = structure(c(526L, 375L, 2931L, 994L,
2061L, 419L, 319L, 330L, 1737L, 623L, 1868L, 350L, 402L, 207L,
1880L, 577L, 887L, 303L, 413L, 64L, 838L, 698L, 939L, 484L, 146L,
126L, 496L, 440L, 1183L, 184L, 462L, 37L, 26L, 782L, 271L, 310L
), .Dim = c(6L, 6L), .Dimnames = list(c("F3D0", "F3D1", "F3D11",
"F3D125", "F3D13", "F3D141"), c("GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGAT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCT", "GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTT", "CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGC"))), taxa_are_rows = FALSE)

> dput(short_tax2)
new("taxonomyTable", .Data = structure(c("Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteroidetes",
"Bacteroidetes", "Bacteroidetes", "Bacteroidetes", "Bacteroidetes",
"Bacteroidetes", "Bacteroidia", "Bacteroidia", "Bacteroidia",
"Bacteroidia", "Bacteroidia", "Bacteroidia", "Bacteroidales",
"Bacteroidales", "Bacteroidales", "Bacteroidales", "Bacteroidales",
"Bacteroidales", "Bacteroidales_S24-7_group", "Bacteroidales_S24-7_group",
"Bacteroidales_S24-7_group", "Bacteroidales_S24-7_group", "Bacteroidaceae",
"Bacteroidales_S24-7_group", NA, NA, NA, NA, "Bacteroides", NA
), .Dim = c(6L, 6L), .Dimnames = list(c("GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGAT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCT", "GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTT", "CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGC"), c("Kingdom",
"Phylum", "Class", "Order", "Family", "Genus"))))

Solution

  • This is to the part of the question "replace actual sequences with OTU ids (Phyloseq/dada2)?"

    I contacted the phyloseq/dada2 developers and based on Susan Holmes' reply (https://github.com/joey711/phyloseq/issues/1030) I came up with this piece of code to replace the amplicon sequences with a numbered OTU header.

    Further discussion can be found here: https://github.com/joey711/phyloseq/issues/213

    # this changes the header from the actual sequence to Seq_001, Seq_002 etc
    taxa_names(ps)
    n_seqs <- seq(ntaxa(ps))
    len_n_seqs <- nchar(max(n_seqs))
    taxa_names(ps) <- paste("Seq", formatC(n_seqs, 
                                                width = len_n_seqs, 
                                                flag = "0"), sep = "_")
    taxa_names(ps)
    

    A possible way to get taxonomy included into the header is the following (continuing from above):

    # generate a vector containing the full taxonomy path for all OTUs
    wholetax <- do.call(paste, c(as.data.frame(tax_table(ps))
                      [c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")], 
                      sep = "__"))  # to distinguish from "_" within tax ranks
    
    # turn the otu_table into a data.frame
    otu_export <- as.data.frame(otu_table(ps))
    tmp <- names(otu_export)
    
    # paste wholetax and OTU_ids together
    for(i in 1:length(tmp)){
    names(tmp)[i] = paste(wholetax[i], tmp[i], sep = "__")
    }
    
    # overwrite old names
    names(otu_export) <- names(tmp)
    
    > head(otu_export)[5]
    
    # output:  
         Bacteria__Bacteroidetes__Bacteroidia__Bacteroidales__Bacteroidaceae__Bacteroides__Seq_005
    F3D0                                                                                         146
    F3D1                                                                                         126
    F3D11                                                                                        496
    F3D125                                                                                       440
    F3D13                                                                                       1183
    F3D141                                                                                       184
    

    This yet does not include a test of correct sorting between the tables! So make sure the pasting and overwriting is correct.

    This way you have a data.frame containing taxonomy "splittable" for each taxonomic rank, the OTU id, the sample name and the counts in one file. But apart from the export file, you still maintain the phyloseq-structure, where the OTU_ids link the different tables such as otu_table() and tax_table(). Another way would be to provide the wholetax vector to the taxa_names() command, I haven't tested that.

    Suggestions for improvements are highly welcomed!