Search code examples
pythonobjective-cperllcssequence-alignment

Multiple Sequence Alignment (Longest Common Subsequence)?


OK this is what I want to do:

Get more than two strings and "align" them (no DNA/RNA sequence or the like, just regular strings with not like 1000 items in each of them)

I've already done some work with pairwise alignment (align two strings) however the "gaps" create some issues for me when trying to align more than one pair.

Example (one I'm currently testing):

ABCDEF
ABGHCEEF
AJKLBCDYEOF

AB--CDEF
ABGHCEEF
=======================
AB--C-EF

A-B--C--E-F
AJKLBCDYEOF
=======================
A----C--E-F

And another (more illustrative) example :

http://nest.drkameleon.com
http://www.google.com
http://www.yahoo.com

http://nest.drkameleon.com
http://-www.--google--.com

=======================
http://----.------le--.com

http://----.------le--.com
http://-www.-----yahoo.com

=======================
http://----.----------.com

What I'm currently doing :

  • Sort the strings (longer strings come first in the list)
  • Align the first pair : A-B and get the result (let's say R1)
  • Then align the second pair : R1 and C (result in R2)
  • Then align the third pair : R2 and D
  • And so on...

So what's in your mind? How could I go for that? Is there a better way? (Of course, there must be...)

I'd rather do that in Perl/Python or something along these lines, however any type of code/reference would be more than welcome! :-)


Solution

  • I think you may be able to cast this problem as a more general string diff problem instead of a string alignment. Consider how GNU diff is used for finding differences between two files, and use the same algorithms as are used to perform an N-way diff.

    I'm not sure if the time/memory complexity of this approach is amenable to your needs, but you can at least think about the problem this way.