Search code examples
rfor-loopwhile-looptidyrplyr

adding a data to new column if seperated by "+" sign using R


Following previous question,enter link description here I have extra informations with my data,I included the gene with the data. Since same gene were predicted as different enzyme, results were combined as "+" sign, but now I would like to split the results as given her below My dataframe look like following

df <-data.frame(Gene= c("A", "B", "C","D","E","F"),
                 G1=c("GH13_22+CBM4",  "GH109+PL7+GH9","GT57", "AA3","",""),
                 G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
                 G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))

and output if like this one down here

df2<-data.frame(Gene= c("A","A","B", "B","B","C","C","D","D","D","E","F","F","F","F"),
                G1=c("GH13_22","CBM4","GH109","PL7","GH9","GT57","GT57","AA3","AA3","AA3","","","","",""),
                G2=c("GH13_22","GH13_22","","","","GT57","GH15","AA3","AA3","AA3", "GT41","PL","PL2","",""),
            G3=c("GH13","","GH1O9","GH1O9", "GH1O9","","","CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))

Kindly help


Solution

  • It was harder than I thought but here's a way.

    The main idea is to use the function str_split_fixed to split string and return a fixed number of separated values, with "" padded if the input is too short. Note: I selected 4 here, but you can choose an upper bound much higher to accommodate for longer strings.

    library(stringr)
    df[-1] <- lapply(df[-1], \(x) asplit(str_split_fixed(x, "\\+", 4), 1))
    
    #  Gene                G1             G2                       G3
    #1    A GH13_22, CBM4, ,   GH13_22, , ,                GH13, , , 
    #2    B GH109, PL7, GH9,          , , ,               GH1O9, , , 
    #3    C        GT57, , ,  GT57, GH15, ,                    , , , 
    #4    D         AA3, , ,       AA3, , ,      CBM34, GH13, CBM48, 
    #5    E            , , ,      GT41, , ,                GT41, , , 
    #6    F            , , ,     PL, PL2, ,  GH16, CBM4, CBM54, CBM32
    

    This results in a data.frame with G1:G3 as column-matrix, i.e. each element is a matrix of size 1 x 4. Then, the remaining code unnests the matrices to multiple elements in long format, replace empty strings with NAs, remove rows with only NAs, and then fill the remaining values by group:

    library(dplyr)
    library(tidyr)
    
    unnest_longer(df, col = G1:G3) %>% 
      mutate(across(G1:G3, ~ na_if(.x, ""))) %>% 
      filter(if_any(G1:G3, complete.cases)) %>% 
      group_by(Gene) %>% 
      fill(G1:G3)
    
       Gene      G1      G2    G3
    1     A GH13_22 GH13_22  GH13
    2     A    CBM4 GH13_22  GH13
    3     B   GH109    <NA> GH1O9
    4     B     PL7    <NA> GH1O9
    5     B     GH9    <NA> GH1O9
    6     C    GT57    GT57  <NA>
    7     C    GT57    GH15  <NA>
    8     D     AA3     AA3 CBM34
    9     D     AA3     AA3  GH13
    10    D     AA3     AA3 CBM48
    11    E    <NA>    GT41  GT41
    12    F    <NA>      PL  GH16
    13    F    <NA>     PL2  CBM4
    14    F    <NA>     PL2 CBM54
    15    F    <NA>     PL2 CBM32