Search code examples
rstringdataframesplit

Split string of characters contained in a row of a data frame by a fixed number of characters and store the resultant fragments in subsequent rows


I have the following data frame:

df <- data.frame(V1 = c(">A1_[Er]", 
                        "aaaabbbcccc", 
                        ">B2_[Br]", 
                        "ddddeeeeeff", 
                        ">C3_[Gh]", 
                        "ggggggghhhhhiiiiijjjjjj"))

I want to split the strings by the fixed number of characters (two for the purpose of this particular question) and place them in new rows. I also want to exclude the rows containing strings starting with ">" sign. The resultant data frame should look like this:

df1 <- data.frame(V1 = c(">A1_[Er]", "aa", "aa", "bb", "bc", "cc", "c", 
                         ">B2_[Br]", "dd", "dd", "ee", "ee", "ef", "f",
                         ">C3_[Gh]", "gg", "gg", "gg", "gh", "hh", "hh", "ii", "ii", "ij", "jj", "jj", "jj"))

I have tried using separate_longer_position() function on a subseted df like this:

separate_longer_position(subset(df, !df$V1 %like% ">"), V1, 2)

My approach did indeed chop up the desired strings, but also left the rows containing the strings starting with ">" out from the resultant data frame.

On a side note, this is indeed a FASTA format, but for educationl purposes, I dont want to use dedicated packages like Biostrings to solve this.

Please advise.


Solution

  • You can try regmatches

    df1 <-
      data.frame(V1 = with(
        df,
        unlist(
          lapply(
            V1,
            function(x) {
              if (startsWith(x, ">")) {
                x
              } else {
                regmatches(x, gregexpr("\\w{1,2}", x))
              }
            }
          )
        )
      ))
    

    and obtain

    > df1
             V1
    1  >A1_[Er]
    2        aa
    3        aa
    4        bb
    5        bc
    6        cc
    7         c
    8  >B2_[Br]
    9        dd
    10       dd
    11       ee
    12       ee
    13       ef
    14        f
    15 >C3_[Gh]
    16       gg
    17       gg
    18       gg
    19       gh
    20       hh
    21       hh
    22       ii
    23       ii
    24       ij
    25       jj
    26       jj
    27        j