Search code examples
rdataframefasta

How to split a FASTA file imported as data.frame through ">"


I fave imported a FASTA file in R to a single-column data frame that looks like this:

dna.sequences <- data.frame(c(">ID1", "sequence1", ">ID2" , "sequence2", ...))

I want to split this data frame in two columns, and eliminate the '>' before every ID so I finally get something like this

    new_dna <- data.frame(
          ID = c("ID1", "ID2" ... ),
            sequence = c("sequence1", "sequence2" ... )              
            )

Thanks in advance, Jose


Solution

  • Let's say your file is like this:

    writeLines(">ID1\nGAGA\n>ID2\nTATA","test.fa")
    dna.sequences = read.table("test.fa")
    
    dna.sequences
        V1
    1 >ID1
    2 GAGA
    3 >ID2
    4 TATA
    

    Assuming it's read correctly:

    rows = 1:nrow(dna.sequences)
    data.frame(ID = gsub(">","",as.character(dna.sequences[rows %% 2==1,1])),
    sequences = dna.sequences[rows %% 2==0,1])
    

    Or much better, read it in directly using a package meant for this:

    library(Biostrings)
    data = readDNAStringSet("test.fa")
    
    data
      A DNAStringSet instance of length 2
        width seq                                               names               
    [1]     4 GAGA                                              ID1
    [2]     4 TATA                                              ID2
    
    dna.sequences = data.frame(ID=names(data),sequences=as.character(data))
    
    dna.sequences
         ID sequences
    ID1 ID1      GAGA
    ID2 ID2      TATA