Search code examples
pythonparsingbiopythonfasta

Parsing fasta file with biopython to count number sequence reads belonging to each ID


I am new to python programming and attempting to parse a fasta file and count the number of reads belonging to each ID in the file (working with toy example here but, plan to use for metagenomic seq files with 1-100,000 reads per subject). The output I would like to obtain would be a text file of something like:

Total reads: 8
Mean read length: 232.5
Median: 234.5
Mode: 250
Max: 250
Min: 209

Sample   Count
001-00   1
002-00   4
003-00   3
Etc.

Working through the examples provided in the biopython cookbook and other posts I have been able to cobble together the following code that will generate the descriptive statistics for the read lengths and give me the SampleID and read length for a single read, but I can't seem to get my head around how to best count the number of times each ID shows up in the file and format the output as above (with a tab separating the sample and tab fields). The code I have put together so far is:

from Bio import SeqIO
import statistics

records = list(SeqIO.parse("test_fasta.fasta", "fasta"))
print("Total reads: %i" % len(records))

sizes = [len(rec) for rec in SeqIO.parse("test_fasta.fasta", "fasta")]

print("Mean read length:", statistics.mean(sizes))
print("Median:", statistics.median(sizes))
print("Mode:", statistics.mode(sizes))
print("Max:", max(sizes))
print("Min:", min(sizes))
print()

generator = SeqIO.parse("test_fasta.fasta", "fasta")
print("Sample", "\t", "Read Length")
for seqrecord in generator:
    idkeep, rest = seqrecord.id.split(';',1)
    print(idkeep, "\t", len(seqrecord))

But this of course gives the read length and not the counts. Any thoughts on how to go about this would be much appreciated (as would any thoughts on blatant inefficiencies in what I have written thus far).


Solution

  • You could just have a dict id -> number of occurrences and update it while iterating through the records. Also, I think you could remove the:

    generator = SeqIO.parse("test_fasta.fasta","fasta")
    

    line, since you already have the records list. You can also change the line:

    sizes = [len(rec) for rec in SeqIO.parse("test_fasta.fasta", "fasta")]
    

    to:

    sizes = [len(rec) for rec in records]
    

    So you only parse the fasta file once.

    I would add to your code:

    occurrences_dict = {}
    for seqrecord in records:
        idkeep, rest = seqrecord.id.split(';',1)
        occurrences_dict[idkeep] = occurrences_dict.get(idkeep, 0) + 1
    

    Then you could print how many times each ID occurs:

    for k in occurrences_dict:
        print k, occurrences_dict[k]