Search code examples
pythonlistiterator

Interleave different length lists, elimating duplicates, and preserve order


I have two lists, let's say:

keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I']
keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K']

How do I create a merged list without duplicates that preserve the order of both lists, inserting the missing elements where they belong? Like so:

merged = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']

Note that the elements can be compared against equality but not ordered (they are complex strings). The elements can't be ordered by comparing them, but they have an order based on their occurrence in the original lists.

In case of contradiction (different order in both input lists), any output containing all elements is valid. Of course with bonus points if the solution shows 'common sense' in preserving most of the order.

Again (as some comments still argue about it), the lists normally don't contradict each other in terms of the order of the common elements. In case they do, the algorithm needs to handle that error gracefully.

I started with a version that iterates over the lists with .next() to advance just the list containing the unmatched elements, but .next() just doesn't know when to stop.

merged = []
L = iter(keys1)
H = iter(keys2)
l = L.next()
h = H.next()

for i in range(max(len(keys1, keys2))):
  if l == h:
    if l not in merged:
      merged.append(l)
    l = L.next()
    h = H.next()

  elif l not in keys2:
    if l not in merged:
      merged.append(l)
    l = L.next()

  elif h not in keys1:
    if h not in merged:
      merged.append(h)
    h = H.next()

  else: # just in case the input is badly ordered
    if l not in merged:
      merged.append(l)
    l = L.next()
    if h not in merged:
      merged.append(h)
    h = H.next()   

print merged

This obviously doesn't work, as .next() will cause an exception for the shortest list. Now I could update my code to catch that exception every time I call .next(). But the code already is quite un-pythonic and this would clearly burst the bubble.

Does anyone have a better idea of how to iterate over those lists to combine the elements?

Bonus points if I can do it for three lists in one go.


Solution

  • What you need is basically what any merge utility does: It tries to merge two sequences, while keeping the relative order of each sequence. You can use Python's difflib module to diff the two sequences, and merge them:

    from difflib import SequenceMatcher
    
    def merge_sequences(seq1,seq2):
        sm=SequenceMatcher(a=seq1,b=seq2)
        res = []
        for (op, start1, end1, start2, end2) in sm.get_opcodes():
            if op == 'equal' or op=='delete':
                #This range appears in both sequences, or only in the first one.
                res += seq1[start1:end1]
            elif op == 'insert':
                #This range appears in only the second sequence.
                res += seq2[start2:end2]
            elif op == 'replace':
                #There are different ranges in each sequence - add both.
                res += seq1[start1:end1]
                res += seq2[start2:end2]
        return res
    

    Example:

    >>> keys1 = ['A', 'B', 'C', 'D', 'E',           'H', 'I']
    >>> keys2 = ['A', 'B',           'E', 'F', 'G', 'H',      'J', 'K']
    >>> merge_sequences(keys1, keys2)
    ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K']
    

    Note that the answer you expect is not necessarily the only possible one. For example, if we change the order of sequences here, we get another answer which is just as valid:

    >>> merge_sequences(keys2, keys1)
    ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'I']