I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.
A simplified example:
"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"
Should result in a consensus sequence of "AGGGGC"
.
I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.
Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?
Thanks!
Add gap characters at the end of your sequences so that they all have the same length. Then you can process them with your program of choice, e.g. MUSCLE:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align.Applications import MuscleCommandline
sequences = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
print cline