I have a vector that has strings of different lengths: The vector looks like the example below:
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
I need to make a data frame to divide each string according to taxonomic annotation: "domain","phylum","class","order","family","genus"
I tried:
taxon <- str_split(clade_names, "\\|", simplify = T)
It works for splitting it perfectly, but it fills up the data frame from left to right and I need it to be filled according to taxonomic level.
I believe I would need to use grepl
to match "d_","p_", "c_", "o_", "f_", "g_"
But I am not managing to figure out how to write it correctly.
Thank you very much for the help.
Using data.table, split on "|"
, reshape wide-to-long, then split on "_"
to get taxonomic annotation group, then reshape long-to-wide:
library(data.table)
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
taxon <- data.table(x = TX)
taxon[, tstrsplit(x, "|", fixed = TRUE)
][, rn := seq_len(.N)
][, melt(.SD, id.var = "rn")
][, c("grp", "name") := tstrsplit(value, "_")
][!is.na(value), dcast(.SD, rn ~ grp, value.var = "value")]
# rn c d f g o p
# 1: 1 <NA> d_Bacteria <NA> g_Thermobaculum <NA> <NA>
# 2: 2 c_Acidobacteria subdivision d_Bacteria f_Vicinamibacteraceae g_Luteitalea <NA> p_Acidobacteria
# 3: 3 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Acidobacterium o_Acidobacteriales p_Acidobacteria
# 4: 4 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Candidatus Koribacter o_Acidobacteriales p_Acidobacteria
# 5: 5 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Granulicella o_Acidobacteriales p_Acidobacteria
# 6: 6 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Terriglobus o_Acidobacteriales p_Acidobacteria