I'm trying to merge a number of sequences, as in the following example:
x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']
merged = ['one', 'two', 'three', 'four', 'five']
The given sequences are all subsequences of the same, duplicate-free sequence (which is not given). If the order cannot be determined – as with 'four'
and 'five'
in the example, which could also be inverted – either solution is ok.
The problem resembles multiple sequence alignment, but I suspect there is an (algorithmically) easier solution, since it is more restricted (no duplicates, no crossing edges). Eg. when starting from the union of all elements, I would only need to order the elements – but I can't seem to find a decent way to deduce the underlying order from the input sequences.
The example is in Python and a desired solution would also be, but the problem is of general algorithmic nature.
Here is a very inefficient method that should do what you want:
w = ['zero', 'one']
x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']
def get_score(m, k):
v = m[k]
return sum(get_score(m, kk) for kk in v) + 1
m = {}
for lst in [w,x,y,z]:
for (i,src) in enumerate(lst):
if src not in m: m[src] = []
for (j,dst) in enumerate(lst[i+1:]):
m[src].append(dst)
scored_u = [(k,get_score(m,k)) for k in m]
scored_s = sorted(scored_u, key=lambda (k,s): s, reverse=True)
for (k,s) in scored_s:
print(k,s)
Output:
('zero', 13) ('one', 12) ('two', 6) ('three', 3) ('four', 1) ('five', 1)
The approach first builds a mapping m
where the keys are the terms of the lists and the values are a list of terms that are found to have followed the key.
So in this case, m
looks like:
{
'three': ['five', 'four'],
'two': ['four', 'three', 'five'],
'four': [],
'zero': ['one'],
'five': [],
'one': ['two', 'four', 'three', 'four']
}
From there, it computes a score for each key. The score is defined by the sum of the scores of the elements that have been seen to follow it, plus 1.
So
get_score(m, 'four') = 1
get_score(m, 'five') = 1
# and thus
get_score(m, 'three') = 3 # (1(four) + 1(five) + 1)
It does this for each element found in the input lists (in my case w,x,y,z
) and computes the total score, then sorts it by score, descending.
I say this is inefficient because this get_score
could be memoized, so that you only had to determine the score of a key once. You'd likely do this via backtracking -- compute the scores of keys where the value was an empty list, and work backwards. In the current implementation, it determines the score for some keys multiple times.
Note: All this guarantees is that an element's score won't be lower than where it "is expected". For example, adding
v = ['one-point-five', 'four']
Into the mix will place one-point-five
above four
on the list, but since you're only referencing it once, in v
, there's not enough context to do a better job.