I have a large FASTA file that I need to analyse for a class.
I found help in the question asked here How to search and isolate attributes of FASTA formatted text in R. However, I am still having trouble manipulating the data.
using the function getAnnots()
, I get a list of the "annots" in the following format:
>annots
[[i]]
[1] ">SourceAccessionCode | StrainName | type / subtupe | OtherInfo | "
I want to change this list format into a data frame where each element of the list, each on a separate row, is split into four columns (each containing the information in the example above).
I tried different combinations of the strsplit()
function with sapply()
and for
loops, but to no avail.
even using the strsplit()
on it's own is giving unsatisfactory results
strsplit(GISAnnots[[i]], split = " | ") [[i]] [1] ">sourceAccessionCode" "|" "StrainName" "|"
[5] "Type" "/" "Subtype" "|"
[9] "MoreInfo" "|"
And using for
loops gives the following results
> info <- for (i in 1:length(GISAnnots))
+ strsplit(GISAnnots[[i]], split = " | ")
> info
NULL
I apologise I do not have a concrete example because I can not think of an example to show the work, and I can't use my own data as an example due to copyright restraints.
Thank you for your help
Here's some data
elt = ">SourceAccessionCode | StrainName | type / subtupe | OtherInfo | "
lst = list(elt, elt))
Probably the first problem is that this is a list, but you'd like it to be unlisted. A neat trick for not too large data is to pretend that the text is input to read.delim()
or similar
> read.delim(text=unlist(lst), sep="|", header=FALSE, strip.white=TRUE)
V1 V2 V3 V4 V5
1 >SourceAccessionCode StrainName type / subtupe OtherInfo NA
2 >SourceAccessionCode StrainName type / subtupe OtherInfo NA
maybe adding stringsAsFactors=FALSE
. The Biostrings package also has readDNAStringSet()
for working with fasta files, where the names of the fasta sequences would be retrieved with names(readDNAStringSet('your.fasta'))