Merge sequences of unique elements

I'm trying to merge a number of sequences, as in the following example:

x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']

merged = ['one', 'two', 'three', 'four', 'five']

The given sequences are all subsequences of the same, duplicate-free sequence (which is not given). If the order cannot be determined – as with 'four' and 'five' in the example, which could also be inverted – either solution is ok.

The problem resembles multiple sequence alignment, but I suspect there is an (algorithmically) easier solution, since it is more restricted (no duplicates, no crossing edges). Eg. when starting from the union of all elements, I would only need to order the elements – but I can't seem to find a decent way to deduce the underlying order from the input sequences.

The example is in Python and a desired solution would also be, but the problem is of general algorithmic nature.

Solution

Here is a very inefficient method that should do what you want:

w = ['zero', 'one']
x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']

def get_score(m, k):
    v = m[k]
    return sum(get_score(m, kk) for kk in v) + 1

m = {}
for lst in [w,x,y,z]:
    for (i,src) in enumerate(lst):
        if src not in m: m[src] = []
        for (j,dst) in enumerate(lst[i+1:]):
            m[src].append(dst)

scored_u = [(k,get_score(m,k)) for k in m]
scored_s = sorted(scored_u, key=lambda (k,s): s, reverse=True)

for (k,s) in scored_s:
    print(k,s)

Output:

('zero', 13)
('one', 12)
('two', 6)
('three', 3)
('four', 1)
('five', 1)

The approach first builds a mapping m where the keys are the terms of the lists and the values are a list of terms that are found to have followed the key.

So in this case, m looks like:

{
  'three': ['five', 'four'], 
  'two':   ['four', 'three', 'five'], 
  'four':  [], 
  'zero':  ['one'], 
  'five':  [], 
  'one':   ['two', 'four', 'three', 'four']
}

From there, it computes a score for each key. The score is defined by the sum of the scores of the elements that have been seen to follow it, plus 1.

get_score(m, 'four') = 1
get_score(m, 'five') = 1
# and thus
get_score(m, 'three') = 3  # (1(four) + 1(five) + 1)

It does this for each element found in the input lists (in my case w,x,y,z) and computes the total score, then sorts it by score, descending.

I say this is inefficient because this get_score could be memoized, so that you only had to determine the score of a key once. You'd likely do this via backtracking -- compute the scores of keys where the value was an empty list, and work backwards. In the current implementation, it determines the score for some keys multiple times.

Note: All this guarantees is that an element's score won't be lower than where it "is expected". For example, adding

v = ['one-point-five', 'four']

Into the mix will place one-point-five above four on the list, but since you're only referencing it once, in v, there's not enough context to do a better job.