Python: build consensus sequence

I want to build a consensus sequence from several sequences in python and I'm looking for the most efficient / most pythonic way to achieve this.

I have a list of strings like this:

sequences = ["ACTAG", "-TTCG", "CTTAG"]

I furthermore have an alphabet like this:

alphabet = ["A", "C", "G", "T"]

and a position frequency matrix like this:

If a character occurrs the most at a position, this character is taken for the consensus sequence.

Additionally, when 2 or more characters have the same occurrences for the same position there are additional characters (in this example at position 0 => A or C = M, see IUPAC Codes)

The expected consensus sequence for my example is therefore "MTTAG".

EDIT:

What is the most efficient / most pythonic way to get this consensus sequence based on the given alphabet and position frequency matrix?

Solution

If you already have the position frequency matrix, you could process it as a pandas DataFrame. I chose to orient it such that the alphabet is the index (note the transpose call at the end):

freq = pd.DataFrame([[1, 1, 0, 0], [0, 1, 0, 2], [0, 0, 0, 3], [2, 1, 0, 0], [0, 0, 3, 0]], columns=['A', 'C', 'G', 'T']).transpose()

gives

   0  1  2  3  4
A  1  0  0  2  0
C  1  1  0  1  0
G  0  0  0  0  3
T  0  2  3  0  0

You'll want to look only at the most common nucleotides:

most_common = freq[freq == freq.max(axis=0)]

gives

     0    1    2    3    4
A  1.0  NaN  NaN  2.0  NaN
C  1.0  NaN  NaN  NaN  NaN
G  NaN  NaN  NaN  NaN  3.0
T  NaN  2.0  3.0  NaN  NaN

Then create a function that determines the consensus from a single column of the above matrix, based on IUPAC codes:

codes = {
    'A': 'A', 'C': 'C', 'G': 'G', 'T': 'T', 
    'AG': 'R', 'CT': 'Y', 'CG': 'S', 'AT': 'W', 'GT': 'K', 'AC': 'M', 
    'CGT': 'B', 'AGT': 'D', 'ACT': 'H', 'ACG': 'V', 
    'ACGT': 'N'
}
def freq_to_code(pos):
    top_nucs = pos.dropna().index
    key = ''.join(sorted(top_nucs))
    return codes[key]

Apply that function to each column and form a string to get the final result:

consensus = most_common.apply(freq_to_code, axis=0)
print(''.join(consensus))

gives MTTAG