Search code examples
pythonfor-loopdictionaryfasta

Python Dict and Forloop with FASTA file


I was given a FASTA formatted file (like from this site: http://www.uniprot.org/proteomes/) that gives various protein coding sequences within a certain bacteria. I have been asked to give a full count and the relative percentage of each of the single code amino acids contained within the file, and return the results like:

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)

What I have so far:

 #!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

for key in counts:
    print key, counts[key]

I believe that doing this is retrieving all of the instances of the capital letters and not just those contained within the protein amino acid string, how can I limit it just to the coding sequence? I am also having trouble writing how to calculate the each single code over the total


Solution

  • The only lines that don't contain what you want start with > just ignore those:

    with open("input.fasta") as ecoli: # will close your file automatically
        from collections import defaultdict
        counts = defaultdict(int) 
        for line in ecoli: # iterate over file object, no need to read all contents into memory
            if line.startswith(">"): # skip lines that start with >
                continue
            for char in line: # just iterate over the characters in the line
                if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                        counts[char] += 1
        total = float(sum(counts.values()))       
        for key,val in counts.items():
            print("{}: {}, ({:.1%})".format(key,val, val / total))
    

    You could also use a collections.Counter dict as the lines only contain what you are interested in:

    with open("input.fasta") as ecoli: # will close your file automatically
        from collections import Counter
        counts = Counter()
        for line in ecoli: # iterate over file object, no need to read all contents onto memory
            if line.startswith(">"): # skip lines that start with >
                continue
            counts.update(line.rstrip())
        total = float(sum(counts.values()))
        for key,val in counts.items():
            print("{}: {}, ({:.1%})".format(key,val, val / total))