Search code examples
rsplitdataframefasta

Splitting the data.frame into 2 columns


I have a fasta file and I read the fasta file using "read.delim" into R. The corresponding data.frame looks like following:

>tm_sd_1256_2_1
MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFD
ASDJASDJ
>tm_sd_5672_1_2
AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGS
SADASDA[sample.fasta file][1]
>tm_sd_543_1_2
MUZTREQWERNBVXCYMNMVHZTOPOPOEURDASDOPOQWEUZQUIZRZIRIEIWUEWASDHASHDAHSDHAKHHSDHASHDJASHDAHUWIEUROWUOERUOWEUROOWWWW
>tm_sd_212_0_2
MTZTPSPASDASZDATSZGZASDZATSDASDARSDASDASDASDASDZTASZDTAXAYXFASTDRASRZWUEWERZWERZ

I would like split this data.frame into two columns.One column for names of the sequence and the other column for the respective sequences.

I created a data.frame and stored the names of sequences in one column but when I tried store the corresponding sequences in another column, it throwed me an error saying that replacement has 55 rows and data has 436 rows.

The following code I tried and it gave me an error as follows:

new_DF=NULL
new_DF$names=as.data.frame(names(fasta_seq))
new_DF$sequences=as.data.frame(fasta_seq)

How can I achieve this using R. kindly guide me.


Solution

  • Try

    lines <- readLines('deena.fasta')
    indx <- grepl('>', lines)
    Sequence <- tapply(seq_along(indx),cumsum(indx), FUN=function(x) 
                paste(lines[tail(x,-1)], collapse=""))
    d1 <- data.frame(names=lines[indx], Sequence, stringsAsFactors=FALSE)
    head(d1,2)
    #           names
    #1 >tm_sd_1256_2_1
    #2 >tm_sd_5672_1_2
                                                                               #                         Sequence
    # 1                                              MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFDASDJASDJ
    # 2 AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGSSADASDA[sample.fasta file][1]