bioinformatics biopython fasta bioconductor bioperl

multiFASTA file processing

I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.

Can you help me? Thanks a lot :-)

Solution

Some of the emboss tools are a collection of small tools that can help you out.

seqstats returns sequence length
pepstats should give you aminoacid content etc. Some of the tools also offer plotting functions. Very handy. http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html

To count number of fasta entries, I use: grep -c '^>' mySequences.fasta.

To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l