Search code examples
sequencemultiple-columnscompositionfasta

csv_amino acid composition_columnwise


Sample file:

Column 10: A|Y|E|A
Column 11: W|I|Q|Q

How do I calculate amino acid composition (percentage) specific to each column? for ex: composition of A in column 10 is 50%, E is 25% and Y is 25%.

Biopython provides modules to calculate amino acid composition of entire file in fasta format

from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

for record in SeqIO.parse('output_translation3.fasta', 'fasta'):
    X = ProteinAnalysis(str(record.seq))
    print('\n Results for record: {}'.format(record.id))
    print(X.count_amino_acids()['G'])
    print(X.count_amino_acids()['A'])
    print(X.count_amino_acids()['L'])
    print(X.count_amino_acids()['M'])

Solution

  • from collections import Counter
    import re
    
    with open("input.txt") as f:
        for line in f:
            line=line.strip()
            [col,sep,seq] = re.split(r'(: )', line)
            aa = re.split(r'[|]', seq)
            aa_counts = Counter(aa)
            aa_length=len(aa)
            print(col)
            for k,v in aa_counts.items():
                print("  ", k, v/aa_length)
    

    Gives:

    Column 10
       A 0.5
       Y 0.25
       E 0.25
    Column 11
       W 0.25
       I 0.25
       Q 0.5