Search code examples
rdataframestrsplit

Split strings of different lengths and paste in specific column in a dataframe based on match


I have a vector that has strings of different lengths: The vector looks like the example below:

TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")

I need to make a data frame to divide each string according to taxonomic annotation: "domain","phylum","class","order","family","genus"

I tried:

taxon <- str_split(clade_names, "\\|", simplify = T)

It works for splitting it perfectly, but it fills up the data frame from left to right and I need it to be filled according to taxonomic level.

I believe I would need to use grepl to match "d_","p_", "c_", "o_", "f_", "g_" But I am not managing to figure out how to write it correctly.

Thank you very much for the help.


Solution

  • Using data.table, split on "|", reshape wide-to-long, then split on "_" to get taxonomic annotation group, then reshape long-to-wide:

    library(data.table)
    
    TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
    
    taxon <- data.table(x = TX)
    
    taxon[, tstrsplit(x, "|", fixed = TRUE)
          ][, rn := seq_len(.N)
            ][, melt(.SD, id.var = "rn")
              ][, c("grp", "name") := tstrsplit(value, "_")
                ][!is.na(value), dcast(.SD, rn ~ grp, value.var = "value")]
    #    rn                           c          d                     f                       g                  o               p
    # 1:  1                        <NA> d_Bacteria                  <NA>         g_Thermobaculum               <NA>            <NA>
    # 2:  2 c_Acidobacteria subdivision d_Bacteria f_Vicinamibacteraceae            g_Luteitalea               <NA> p_Acidobacteria
    # 3:  3            c_Acidobacteriia d_Bacteria   f_Acidobacteriaceae        g_Acidobacterium o_Acidobacteriales p_Acidobacteria
    # 4:  4            c_Acidobacteriia d_Bacteria   f_Acidobacteriaceae g_Candidatus Koribacter o_Acidobacteriales p_Acidobacteria
    # 5:  5            c_Acidobacteriia d_Bacteria   f_Acidobacteriaceae          g_Granulicella o_Acidobacteriales p_Acidobacteria
    # 6:  6            c_Acidobacteriia d_Bacteria   f_Acidobacteriaceae           g_Terriglobus o_Acidobacteriales p_Acidobacteria