Search code examples
pythonmergesubsequence

Merge sequences of unique elements


I'm trying to merge a number of sequences, as in the following example:

x = ['one', 'two', 'four']
y = ['two', 'three', 'five']
z = ['one', 'three', 'four']

merged = ['one', 'two', 'three', 'four', 'five']

The given sequences are all subsequences of the same, duplicate-free sequence (which is not given). If the order cannot be determined – as with 'four' and 'five' in the example, which could also be inverted – either solution is ok.

The problem resembles multiple sequence alignment, but I suspect there is an (algorithmically) easier solution, since it is more restricted (no duplicates, no crossing edges). Eg. when starting from the union of all elements, I would only need to order the elements – but I can't seem to find a decent way to deduce the underlying order from the input sequences.

The example is in Python and a desired solution would also be, but the problem is of general algorithmic nature.


Solution

  • Here is a very inefficient method that should do what you want:

    w = ['zero', 'one']
    x = ['one', 'two', 'four']
    y = ['two', 'three', 'five']
    z = ['one', 'three', 'four']
    
    def get_score(m, k):
        v = m[k]
        return sum(get_score(m, kk) for kk in v) + 1
    
    m = {}
    for lst in [w,x,y,z]:
        for (i,src) in enumerate(lst):
            if src not in m: m[src] = []
            for (j,dst) in enumerate(lst[i+1:]):
                m[src].append(dst)
    
    scored_u = [(k,get_score(m,k)) for k in m]
    scored_s = sorted(scored_u, key=lambda (k,s): s, reverse=True)
    
    for (k,s) in scored_s:
        print(k,s)
    

    Output:

    ('zero', 13)
    ('one', 12)
    ('two', 6)
    ('three', 3)
    ('four', 1)
    ('five', 1)
    

    The approach first builds a mapping m where the keys are the terms of the lists and the values are a list of terms that are found to have followed the key.

    So in this case, m looks like:

    {
      'three': ['five', 'four'], 
      'two':   ['four', 'three', 'five'], 
      'four':  [], 
      'zero':  ['one'], 
      'five':  [], 
      'one':   ['two', 'four', 'three', 'four']
    }
    

    From there, it computes a score for each key. The score is defined by the sum of the scores of the elements that have been seen to follow it, plus 1.

    So

    get_score(m, 'four') = 1
    get_score(m, 'five') = 1
    # and thus
    get_score(m, 'three') = 3  # (1(four) + 1(five) + 1)
    

    It does this for each element found in the input lists (in my case w,x,y,z) and computes the total score, then sorts it by score, descending.

    I say this is inefficient because this get_score could be memoized, so that you only had to determine the score of a key once. You'd likely do this via backtracking -- compute the scores of keys where the value was an empty list, and work backwards. In the current implementation, it determines the score for some keys multiple times.

    Note: All this guarantees is that an element's score won't be lower than where it "is expected". For example, adding

    v = ['one-point-five', 'four']
    

    Into the mix will place one-point-five above four on the list, but since you're only referencing it once, in v, there's not enough context to do a better job.