Search code examples
pythonsequencedna-sequence

Python: build consensus sequence


I want to build a consensus sequence from several sequences in python and I'm looking for the most efficient / most pythonic way to achieve this.

I have a list of strings like this:

sequences = ["ACTAG", "-TTCG", "CTTAG"]

I furthermore have an alphabet like this:

alphabet = ["A", "C", "G", "T"]

and a position frequency matrix like this:

[A C G T]
 1 1 0 0
 0 1 0 2
 0 0 0 3
 2 1 0 0
 0 0 3 0

If a character occurrs the most at a position, this character is taken for the consensus sequence.

Additionally, when 2 or more characters have the same occurrences for the same position there are additional characters (in this example at position 0 => A or C = M, see IUPAC Codes)

The expected consensus sequence for my example is therefore "MTTAG".

EDIT:

What is the most efficient / most pythonic way to get this consensus sequence based on the given alphabet and position frequency matrix?


Solution

  • If you already have the position frequency matrix, you could process it as a pandas DataFrame. I chose to orient it such that the alphabet is the index (note the transpose call at the end):

    freq = pd.DataFrame([[1, 1, 0, 0], [0, 1, 0, 2], [0, 0, 0, 3], [2, 1, 0, 0], [0, 0, 3, 0]], columns=['A', 'C', 'G', 'T']).transpose()
    

    gives

       0  1  2  3  4
    A  1  0  0  2  0
    C  1  1  0  1  0
    G  0  0  0  0  3
    T  0  2  3  0  0
    

    You'll want to look only at the most common nucleotides:

    most_common = freq[freq == freq.max(axis=0)]
    

    gives

         0    1    2    3    4
    A  1.0  NaN  NaN  2.0  NaN
    C  1.0  NaN  NaN  NaN  NaN
    G  NaN  NaN  NaN  NaN  3.0
    T  NaN  2.0  3.0  NaN  NaN
    

    Then create a function that determines the consensus from a single column of the above matrix, based on IUPAC codes:

    codes = {
        'A': 'A', 'C': 'C', 'G': 'G', 'T': 'T', 
        'AG': 'R', 'CT': 'Y', 'CG': 'S', 'AT': 'W', 'GT': 'K', 'AC': 'M', 
        'CGT': 'B', 'AGT': 'D', 'ACT': 'H', 'ACG': 'V', 
        'ACGT': 'N'
    }
    def freq_to_code(pos):
        top_nucs = pos.dropna().index
        key = ''.join(sorted(top_nucs))
        return codes[key]
    

    Apply that function to each column and form a string to get the final result:

    consensus = most_common.apply(freq_to_code, axis=0)
    print(''.join(consensus))
    

    gives MTTAG