I'm looking for the best way to find duplicate sequences in a list. A sequence is defined as at least two neighbouring values.
Example: In the following list, the repeating sequence should be identified and deleted.
a = [45874,
35195, # <-
28965,
05867,
25847, # <-
94937,
64894,
55535,
62899,
00391,
35195, # Duplicate Sequence Start
28965,
05867,
25847, # Duplicate Sequence End
08483,
55801,
33129,
42616]
I can't wrap my head around any solution, so any help is much appreciated!
Finding a single subsequence of length m in a string of length n can be done in no less than O(nm) with Boyer-Moore algorithm. Finding all duplicate subsequences will most likely be more than that. Although, if all you want is to delete the subsequences and not find them, there is a trick.
We only need to concern ourselves with subsequences of length 2, because any sequence can be expressed as overlapping subsequences of length 2. This observation allows us to keep a count of such pairs and then remove them from right to left.
In particular, this requires traversing the list only a fixed amount of times and is thus O(n).
from collections import Counter
def remove_duplicates(lst):
dropped_indices = set()
counter = Counter(tuple(lst[i:i+2]) for i in range(len(lst) - 1))
for i in range(len(lst) - 2, -1, -1):
sub = tuple(lst[i:i+2])
if counter[sub] > 1:
dropped_indices |= {i, i + 1}
counter[sub] -= 1
return [x for i, x in enumerate(lst) if i not in dropped_indices]
Here is an example based on the provided input.
a = [45874,
35195, # This
28965, # Is
5867, # A
25847, # Subsequence
94937,
64894,
55535,
62899,
391,
35195, # Those
28965, # Should
5867, # Be
25847, # Removed
8483]
b = remove_duplicates(a)
# Output:
# [45874,
# 35195,
# 28965,
# 5867,
# 25847,
# 94937,
# 64894,
# 55535,
# 62899,
# 391,
# <- Here the duplicate was removed
# 8483]