Search code examples
pythonalignment

How to extract polymorphic positions from several protein alignments?


I have several fasta protein alignments (~5000) and I would want to identify polymorphic positions plus the aminoacid residues that change between sequences. I have tried to write a code myself, but it has been very difficult (I'm new in programming), and I looked on BioPython but I hasn't found anything yet. I want something like:

Protein alignment:

> sp1 MQGAAYMQAAAYYMQA
> sp2 MQGAARMQGAAYYMQA
> sp3 MQGAARMQGAAYYMQM
> sp4 MQGAARMQGAAYYMQA
> sp5 MQGAARMQAAAYYMQA
           ^  ^      ^

In the example above, the alignment has 3 polymorphic positions (marked with ^). The first one is located on the 6th position, the second one has the 9th position and the third one the 16th position. A common notation of a polymorphic site coult be as follows: R6Y, which means that a change occured in the 6th position from an R to a Y. The direction of the change (R->Y or Y->R) is based on the most frequent letter at that position. So, in this case R has the highest frequency and one can infer that the direction is R->Y.

As you can see, the 6th and 16th positions have single changes (the different letter is at frequency of 1). However, the 9th position has two sequences (sp1 and sp5) with the change. I would like two distinguish between this two types of polymorphisms. Thus, in this case I woul like an output something like this:

Output :

# Alignment #1
#   Single polymorphisms:
#     R6Y: sp1
#     A16M: sp3

#   Non-single polymorphisms:
#     G9A: sp1, sp5

I hope this has helped to clarify (sorry if it is a bit too long).

Any suggestion is very appreciated, thanks!!


Solution

  • Here's a function that will find the differences between two alleles. The first one should be the canonical polymorph made up of the most frequent letter at every position (MQGAARMQGAAYYMQA in your example).

    def polymorphic_positions(allele1, allele2):
        return [p[1][0] + str(p[0] + 1) + p[1][1] 
                for p in enumerate(zip(allele1, allele2))
                if p[1][0] != p[1][1]]
    

    Examples:

    >>> polymorphic_positions('MQGAARMQGAAYYMQA', 'MQGAARMQGAAYYMQM')
    ['A16M']
    >>> polymorphic_positions('MQGAARMQGAAYYMQA', 'MQGAARMQAAAYYMQA')
    ['G9A']
    

    Here are some references: