I have a fasta file and I read the fasta file using "read.delim" into R. The corresponding data.frame looks like following:
>tm_sd_1256_2_1
MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFD
ASDJASDJ
>tm_sd_5672_1_2
AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGS
SADASDA[sample.fasta file][1]
>tm_sd_543_1_2
MUZTREQWERNBVXCYMNMVHZTOPOPOEURDASDOPOQWEUZQUIZRZIRIEIWUEWASDHASHDAHSDHAKHHSDHASHDJASHDAHUWIEUROWUOERUOWEUROOWWWW
>tm_sd_212_0_2
MTZTPSPASDASZDATSZGZASDZATSDASDARSDASDASDASDASDZTASZDTAXAYXFASTDRASRZWUEWERZWERZ
I would like split this data.frame into two columns.One column for names of the sequence and the other column for the respective sequences.
I created a data.frame and stored the names of sequences in one column but when I tried store the corresponding sequences in another column, it throwed me an error saying that replacement has 55 rows and data has 436 rows.
The following code I tried and it gave me an error as follows:
new_DF=NULL
new_DF$names=as.data.frame(names(fasta_seq))
new_DF$sequences=as.data.frame(fasta_seq)
How can I achieve this using R. kindly guide me.
Try
lines <- readLines('deena.fasta')
indx <- grepl('>', lines)
Sequence <- tapply(seq_along(indx),cumsum(indx), FUN=function(x)
paste(lines[tail(x,-1)], collapse=""))
d1 <- data.frame(names=lines[indx], Sequence, stringsAsFactors=FALSE)
head(d1,2)
# names
#1 >tm_sd_1256_2_1
#2 >tm_sd_5672_1_2
# Sequence
# 1 MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFDASDJASDJ
# 2 AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGSSADASDA[sample.fasta file][1]