I have the following data frame:
df <- data.frame(V1 = c(">A1_[Er]",
"aaaabbbcccc",
">B2_[Br]",
"ddddeeeeeff",
">C3_[Gh]",
"ggggggghhhhhiiiiijjjjjj"))
I want to split the strings by the fixed number of characters (two for the purpose of this particular question) and place them in new rows. I also want to exclude the rows containing strings starting with ">" sign. The resultant data frame should look like this:
df1 <- data.frame(V1 = c(">A1_[Er]", "aa", "aa", "bb", "bc", "cc", "c",
">B2_[Br]", "dd", "dd", "ee", "ee", "ef", "f",
">C3_[Gh]", "gg", "gg", "gg", "gh", "hh", "hh", "ii", "ii", "ij", "jj", "jj", "jj"))
I have tried using separate_longer_position() function on a subseted df like this:
separate_longer_position(subset(df, !df$V1 %like% ">"), V1, 2)
My approach did indeed chop up the desired strings, but also left the rows containing the strings starting with ">" out from the resultant data frame.
On a side note, this is indeed a FASTA format, but for educationl purposes, I dont want to use dedicated packages like Biostrings to solve this.
Please advise.
You can try regmatches
df1 <-
data.frame(V1 = with(
df,
unlist(
lapply(
V1,
function(x) {
if (startsWith(x, ">")) {
x
} else {
regmatches(x, gregexpr("\\w{1,2}", x))
}
}
)
)
))
and obtain
> df1
V1
1 >A1_[Er]
2 aa
3 aa
4 bb
5 bc
6 cc
7 c
8 >B2_[Br]
9 dd
10 dd
11 ee
12 ee
13 ef
14 f
15 >C3_[Gh]
16 gg
17 gg
18 gg
19 gh
20 hh
21 hh
22 ii
23 ii
24 ij
25 jj
26 jj
27 j