Search code examples
pythonlistsequencessubsequence

Python: Identifying and deleting duplicate sequences in list


I'm looking for the best way to find duplicate sequences in a list. A sequence is defined as at least two neighbouring values.

Example: In the following list, the repeating sequence should be identified and deleted.

a = [45874,
    35195, # <-
    28965,
    05867,
    25847, # <-
    94937,
    64894,
    55535,
    62899,
    00391,
    35195, # Duplicate Sequence Start
    28965,
    05867,
    25847, # Duplicate Sequence End
    08483,
    55801,
    33129,
    42616]

I can't wrap my head around any solution, so any help is much appreciated!


Solution

  • Finding a single subsequence of length m in a string of length n can be done in no less than O(nm) with Boyer-Moore algorithm. Finding all duplicate subsequences will most likely be more than that. Although, if all you want is to delete the subsequences and not find them, there is a trick.

    Deleting duplicate subsequences in O(n)

    We only need to concern ourselves with subsequences of length 2, because any sequence can be expressed as overlapping subsequences of length 2. This observation allows us to keep a count of such pairs and then remove them from right to left.

    In particular, this requires traversing the list only a fixed amount of times and is thus O(n).

    from collections import Counter
    
    def remove_duplicates(lst):
        dropped_indices = set()
        counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))
    
        for i in range(len(lst) - 2, -1, -1):
            sub = tuple(lst[i:i+2])
            if counter[sub] > 1:
                dropped_indices |= {i, i + 1}
                counter[sub] -= 1
    
        return [x for i, x in enumerate(lst) if i not in dropped_indices]
    

    Here is an example based on the provided input.

    a = [45874,
         35195, # This
         28965, # Is
         5867,  # A
         25847, # Subsequence
         94937,
         64894,
         55535,
         62899,
         391,
         35195, # Those
         28965, # Should
         5867,  # Be
         25847, # Removed
         8483]
    
    b = remove_duplicates(a)
    # Output:
    #   [45874,
    #    35195,
    #    28965,
    #    5867,
    #    25847,
    #    94937,
    #    64894,
    #    55535,
    #    62899,
    #    391,
    #            <- Here the duplicate was removed
    #    8483]