Search code examples
pythonnumpybioinformaticsdna-sequence

Find the most freq sequence amongst other sequences


I was given 10 DNA sequences each one consisting of 18 bases and I was asked to write a program that calculates the most frequent sequence(consensus) amongst all these sequences. For example let's say I have "ACTGGA","ACCAAT","CTTAGG", "ATGAAG"

The consensus sequence would be "A(CT)TA(GA)G" because in the first base position of each sequence "A" exists 3/4 times.For the second base position of each sequence i have 2 sequences with "C" and 2 sequences with "T" . So far my consensus sequence looks like A(CT), because i have established that "A" occurs 75% of the time in position 1 while for position 2 there's a 50-50% chance "T" or "C" will appear etc.So, i will include both "C" and "T" in the second position of my Consensus. Now I am trying to do this in python using pycharm even though I am not very familiar with it. I just wanted to push myself into learning something new. My problem is that when I am reading the txt file into a numpy array like this:

seq_array = np.asarray(list(map(str.strip, seq)))

My array ends up being 1d like (10,) instead of (10,18) so I can't iterate through every row and col in order to count each char. So how do I read the file into a 2d size (10,18)? and instead of iterating every seq_array[r][c], is there a faster way I can get to the most frequent sequence?


Solution

  • The issue you are having with generating your multidimensional array:

    list(map(str.strip, seq))  # Returns a list of strings
    
    list(map(list, map(str.strip, seq)))  # Will return a list of lists
    
    seq_array = np.asarray(list(map(list, map(str.strip, seq))))