Python: Identifying and deleting duplicate sequences in list

I'm looking for the best way to find duplicate sequences in a list. A sequence is defined as at least two neighbouring values.

Example: In the following list, the repeating sequence should be identified and deleted.

a = [45874,
    35195, # <-
    28965,
    05867,
    25847, # <-
    94937,
    64894,
    55535,
    62899,
    00391,
    35195, # Duplicate Sequence Start
    28965,
    05867,
    25847, # Duplicate Sequence End
    08483,
    55801,
    33129,
    42616]

I can't wrap my head around any solution, so any help is much appreciated!

Solution

Finding a single subsequence of length m in a string of length n can be done in no less than O(nm) with Boyer-Moore algorithm. Finding all duplicate subsequences will most likely be more than that. Although, if all you want is to delete the subsequences and not find them, there is a trick.

Deleting duplicate subsequences in O(n)

We only need to concern ourselves with subsequences of length 2, because any sequence can be expressed as overlapping subsequences of length 2. This observation allows us to keep a count of such pairs and then remove them from right to left.

In particular, this requires traversing the list only a fixed amount of times and is thus O(n).

from collections import Counter

def remove_duplicates(lst):
    dropped_indices = set()
    counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))

    for i in range(len(lst) - 2, -1, -1):
        sub = tuple(lst[i:i+2])
        if counter[sub] > 1:
            dropped_indices |= {i, i + 1}
            counter[sub] -= 1

    return [x for i, x in enumerate(lst) if i not in dropped_indices]

Here is an example based on the provided input.

a = [45874,
     35195, # This
     28965, # Is
     5867,  # A
     25847, # Subsequence
     94937,
     64894,
     55535,
     62899,
     391,
     35195, # Those
     28965, # Should
     5867,  # Be
     25847, # Removed
     8483]

b = remove_duplicates(a)
# Output:
#   [45874,
#    35195,
#    28965,
#    5867,
#    25847,
#    94937,
#    64894,
#    55535,
#    62899,
#    391,
#            <- Here the duplicate was removed
#    8483]