Search code examples
pythonbioinformaticstext-processingbiopython

Get consensus from a MSA fasta file with IUPAC ambiguities in Python


I have an almost similar question to the topic : https://www.biostars.org/p/154993/

I have a fasta file with align sequence and I want to generate a consensus by using IUPAC code.

So far I wrote :

import sys
from Bio import AlignIO
from Bio.Align import AlignInfo

alignment = AlignIO.read("input.fasta", 'fasta')
summary_align = AlignInfo.SummaryInfo(alignment)
summary_align.dumb_consensus(0.3)

The "0.3" is the threshold of allele frequency to be considered in the consensus. It means that for each column, if a base is represented at least in 30% of the aligment, it will be taken into account; and if more than one base fit this criteria, the corresponding IUPAC ambiguity code is used.

But the "dumb_consensus" only generate the highest represented base and de facto don't use the IUPAC code.

So do you have a way to do such a consensus using Biopython ( or Python in general ) ?

Thanks


Solution

  • The raw solving. Although, Biopython code on GitHub looks not better. You can extend this for your aims.

    import sys
    from typing import Optional
    from Bio import AlignIO
    from Bio.Align import AlignInfo
    from Bio.Seq import Seq
    import Bio
    
    def getWitConsensus(alignment: Optional[Bio.Align.MultipleSeqAlignment], threshold=0.25) -> Bio.Seq.Seq:
        alphabet = {"A" : "A",
                    "C" : "C",
                    "G" : "G",
                    "T" : "T",
                    "AG" : "R",
                    "CT" : "Y",
                    "CG" : "S",
                    "AT" : "W",
                    "GT" : "K",
                    "AC" : "M",
                    "CGT" : "B",
                    "AGT" : "D",
                    "ACT" : "H",
                    "ACG" : "V",
                    "ACGT" : "N"}
        
        threshold *= len(alignment)
        consensus = list()
        
        for i in range(alignment.get_alignment_length()):
            frequences = {"A" : 0, "C" : 0, "T" : 0, "G" : 0}
            for record in alignment:
                frequences[record[i]] += 1
            print(frequences)
            consensus.append(alphabet["".join(sorted(key for key, value in frequences.items() if value >= threshold))])
        return Seq("".join(consensus))
    
    alignment = AlignIO.read(r"input.fas", 'fasta')
    getWitConsensus(alignment)