I fave imported a FASTA file in R to a single-column data frame that looks like this:
dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))
I want to split this data frame in two columns, and eliminate the '>' before every ID so I finally get something like this
new_dna <- data.frame(
ID = c("ID1", "ID2" ... ),
sequence = c("sequence1", "sequence2" ... )
)
Thanks in advance, Jose
Let's say your file is like this:
writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")
dna.sequences
V1
1 >ID1
2 GAGA
3 >ID2
4 TATA
Assuming it's read correctly:
rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])
Or much better, read it in directly using a package meant for this:
library(Biostrings)
data = readDNAStringSet("test.fa")
data
A DNAStringSet instance of length 2
width seq names
[1] 4 GAGA ID1
[2] 4 TATA ID2
dna.sequences = data.frame(ID=names(data),sequences=as.character(data))
dna.sequences
ID sequences
ID1 ID1 GAGA
ID2 ID2 TATA