Search code examples
pythonbioinformaticsbiopythondna-sequencesequence-alignment

Multiple Sequence Alignment with Unequal String Length


I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.

A simplified example:

"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"

Should result in a consensus sequence of "AGGGGC".

I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.

Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?

Thanks!


Solution

  • Add gap characters at the end of your sequences so that they all have the same length. Then you can process them with your program of choice, e.g. MUSCLE:

    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    from Bio.Align.Applications import MuscleCommandline
    
    sequences = ["AGGGGC",
                 "AGGGC",
                 "AGGGGGC",
                 "AGGAGC",
                 "AGGGGG"]
    
    longest_length = max(len(s) for s in sequences)
    padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
    records = (SeqRecord(Seq(s)) for s in padded_sequences)
    
    SeqIO.write(records, "msa_example.fasta", "fasta")
    
    from Bio.Align.Applications import MuscleCommandline
    cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
    print cline