Search code examples
algorithmsequencelcs

Longest Common Sequence with Sequences Slightly Out of order Locally


I need to solve the longest common sequence problem between two sequences when they are slightly out of order and the nodes that are out of order are close to each other. In other words, let's say one of the sequences are s1 s2 s3 ... s_n. It may come in with s1 and s2 swapped, or s1 and s3 swapped, or s_{n-1} and s_n swapped, but it is unlikely that s1 and s_n will swap order.

This arises when an observed sequence is a mixture of many sequences generated concurrently. For example, if I have two concurrent sources that generate (a1, a2, ..., a100) and (b1, b2, ..., b100). Many of the nodes are generated at almost exactly the same time, and so the combined sequence that we use for matching can end up becoming (a1, a2, b1, b2, ... ),or (a1, b1, a2, b2); because the nodes a1, a2, a3, b1, b2, b3 might be generated in the same nanosecond and any order is possible. But it is not possible that a1 and b100 swap orders because they are generated seconds apart.

Let's say there are two such sequences, where they are globally in order, but locally there can be small rearrangements. How do I apply the longest common sequence algorithm in this case?


Solution

  • I'm sorry for not answering.

    The idea is to read from each sequence into a staging pool, and then when things are about to expire from the pool, try to pair them up with the other sequence. The bigger you make the pool, the bigger the rearrangement you allow.

    class CurrentPool:
        def __init__ (self, items):
            self.items = items
            self.index = -1
            self.pool = {}
            self.empty = False
    
        def read_into_pool(self):
            self.index += 1
            if self.index < len(self.items):
                self.pool[self.index] = self.items[self.index]
            elif 0 == len(self.pool):
                self.empty = True
            return self.pool
    
    def compare_lists (items1, items2, max_offset=1):
        pool1 = CurrentPool(items1)
        pool2 = CurrentPool(items2)
        trailing_index = -max_offset
        lcs = []
        while not pool1.empty and not pool2.empty:
            temp_pool1 = pool1.read_into_pool()
            temp_pool2 = pool2.read_into_pool()
            # We are forced to take keys if they are at the trailing_index.
            if trailing_index in temp_pool1:
                # Look for a match.
                min_index = None
                for i, val in temp_pool2.items():
                    if val == temp_pool1[trailing_index]:
                        if min_index is None or i < min_index:
                            min_index = i
                del temp_pool1[trailing_index]
                if min_index is not None:
                    lcs.append(temp_pool2[min_index])
                    del temp_pool2[min_index]
    
            if trailing_index in temp_pool2:
                # Look for a match the other way
                min_index = None
                for i, val in temp_pool1.items():
                    if val == temp_pool2[trailing_index]:
                        if min_index is None or i < min_index:
                            min_index = i
                del temp_pool2[trailing_index]
                if min_index is not None:
                    lcs.append(temp_pool1[min_index])
                    del temp_pool1[min_index]
            else:
                # We only drain if we find a good match.
                pass
    
            trailing_index += 1
        return lcs
    
    
    print(compare_lists([1, 2, 3, 4, 5, 6], [2, 1, 4, 5, 7], 2))