How to split a FASTA file imported as data.frame through ">"

I fave imported a FASTA file in R to a single-column data frame that looks like this:

dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))

I want to split this data frame in two columns, and eliminate the '>' before every ID so I finally get something like this

    new_dna <- data.frame(
          ID = c("ID1", "ID2" ... ),
            sequence = c("sequence1", "sequence2" ... )              
            )

Thanks in advance, Jose

Solution

Let's say your file is like this:

writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
dna.sequences = read.table("test.fa")

dna.sequences
    V1
1 >ID1
2 GAGA
3 >ID2
4 TATA

Assuming it's read correctly:

rows = 1:nrow(dna.sequences)
data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
sequences = dna.sequences[rows %% 2==0,1])

Or much better, read it in directly using a package meant for this:

library(Biostrings)
data = readDNAStringSet("test.fa")

data
  A DNAStringSet instance of length 2
    width seq                                               names               
[1]     4 GAGA                                              ID1
[2]     4 TATA                                              ID2

dna.sequences = data.frame(ID=names(data),sequences=as.character(data))

dna.sequences
     ID sequences
ID1 ID1      GAGA
ID2 ID2      TATA