Search code examples
arraysalgorithmalignmentsequence

Most Efficient Algorithm to Align an Multiple Ordered Sequences


I have a strange feeling this is a very easy problem to solve but I'm not finding a good way of doing this without using brute force or dynamic programming. Here it goes:

Given N arrays of ordered and monotonic values, find the set of positions for each array i1, i2 ... in that minimises pair-wise difference of values at those indexes between all arrays. In other words, find the positions for all arrays whose values are closest to each other. Multiple solutions may exist and arrays may or may not be equally sized.

If A denotes the list of all arrays, the pair-wise difference is given by the sum of absolute differences between all values at the given indexes between all different arrays, as so:

enter image description here

An example, 3 arrays a, b and c:

a = [20 29 30 32 33]
b = [28 29 30 32 33]
c = [10 12 28 31 32 33]

The best alignment for this array would be a[3] b[3] c[4] or a[4] b[4] c[5], because (32,32,32) and (33,33,33) are all equal values and have, therefore minimum pairwise difference between each other. (Assuming array index starts at 0)

This is a common problem in bioinformatics thats usually solved with Dynamic Programming, but due to the fact this is an ordered sequence, I think there's somehow a way of exploiting this notion of order. I first thought about doing this pairwise, but this does not guarantee the global optimum because the best local answer might not be the best global answer.

This is meant to be language agnostic, but I don't really mind an answer for a specific language, as long as there is no loss of generality. I know Dynamic Programming is an option here, but I have a feeling there's an easier way to do this?


Solution

  • The tricky thing is parsing the arrays so that at some point you're guaranteed to be considering the set of indices that realize the pairwise min. Using a min heap on the values doesn't work. Counterexample with 4 arrays: [0,5], [1,2], [2], [2]. We start with a d(0,1,2,2) = 7, optimal is d(0,2,2,2) = 6, but the min heap moves us from 7 to d(5,1,2,2) = 12, then d(5,2,2,2) = 9.

    I believe (but haven't proved) that if we alway increment the index that improves pairwise distance the most (or degrades it the least), we're guaranteed to visit every local min and the global min.

    Assuming n total elements across k arrays:

    Simple approach: we repeatedly get the pairwise distance deltas (delta wrt. incrementing each index), increment the best one, and any time doing so switch us from improvement to degradation (i.e. a local minimum) we calculate the pairwise distance. All this is O(k^2) per increment for a total running time of O((n-k) * (k^2)).

    With O(k^2) storage, we could keep an array where (i,j) stores the pairwise distance delta achieve by increment the index of array i wrt. array j. We also store the column sums. Then on incrementing an index we can update the appropriate row & column & column sums in O(k). This gives us a running time of O((n-k)*k)