Search code examples
rsequencebioinformaticsfasta

Getting information from FASTA annots in R


I have a large FASTA file that I need to analyse for a class.

I found help in the question asked here How to search and isolate attributes of FASTA formatted text in R. However, I am still having trouble manipulating the data.
using the function getAnnots(), I get a list of the "annots" in the following format:

>annots
[[i]]
[1] ">SourceAccessionCode | StrainName | type / subtupe | OtherInfo | "

I want to change this list format into a data frame where each element of the list, each on a separate row, is split into four columns (each containing the information in the example above).

I tried different combinations of the strsplit() function with sapply() and for loops, but to no avail.
even using the strsplit() on it's own is giving unsatisfactory results

strsplit(GISAnnots[[i]], split = " | ") [[i]] [1] ">sourceAccessionCode" "|" "StrainName" "|"
[5] "Type" "/" "Subtype" "|"
[9] "MoreInfo" "|"

And using for loops gives the following results

> info <- for (i in 1:length(GISAnnots))
+   strsplit(GISAnnots[[i]], split = " | ")
> info
NULL  

I apologise I do not have a concrete example because I can not think of an example to show the work, and I can't use my own data as an example due to copyright restraints.

Thank you for your help


Solution

  • Here's some data

    elt = ">SourceAccessionCode | StrainName | type / subtupe | OtherInfo | "
    lst = list(elt, elt))
    

    Probably the first problem is that this is a list, but you'd like it to be unlisted. A neat trick for not too large data is to pretend that the text is input to read.delim() or similar

    > read.delim(text=unlist(lst), sep="|", header=FALSE, strip.white=TRUE)
                        V1         V2             V3        V4 V5
    1 >SourceAccessionCode StrainName type / subtupe OtherInfo NA
    2 >SourceAccessionCode StrainName type / subtupe OtherInfo NA
    

    maybe adding stringsAsFactors=FALSE. The Biostrings package also has readDNAStringSet() for working with fasta files, where the names of the fasta sequences would be retrieved with names(readDNAStringSet('your.fasta'))