I have a sequence 'abccabac' and a subsequence 'abc'. I need to get indices of all occurences of subsequence 'abc' in the sequence. What is the memory efficient way of doing it?
**sequence**: 'abccabac' **subsequence**: 'abc'
Output:
012
013
017
057
457
the absolute most strait forward way I can think of is using itertools.combinations
import itertools
sequence = "abccabac"
subsequence = "abc"
for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
if "".join(pair[1] for pair in combo) == subsequence:
print([pair[0] for pair in combo])
If your actual sequence contains lots of characters that are not even in the subsequence then filtering out the irrelevant characters before starting the combinations
would definitely provide a substantial efficiency boost:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
...
as well instead of joining each combination you can just use all
which will only check characters until one is found to be not equal:
char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
for combo in itertools.combinations(char_generator, len(subsequence)):
if all(pair[1]==char for pair,char in zip(combo,subsequence)):
print([pair[0] for pair in combo])
Here is a different alternative I just put together, I imagine the algorithm could be optimized even more but this is about the best I could come up with. The sub_sequence_generator
creates a list of indices for each letter of the sub sequence, then the sub_helper
for each level of the recursion goes through all indices for the next letter of the subsequence starting after the indice of the last letter.
import itertools
import bisect
def sub_helper(cur_i, combo, all_options):
#cur_i is the index of the letter in the subsequence
#cur_idx is the index of the sequence which contains that letter
if cur_i>0:
prev_i = combo[cur_i-1]
else:
prev_i = -1
cur_options = all_options[cur_i]
start_i = bisect.bisect(cur_options,prev_i)
cur_i_gen = itertools.islice(cur_options,start_i,None)
if cur_i+1 == len(all_options): #last one:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield tuple(combo)
else:
for cur_idx in cur_i_gen:
combo[cur_i] = cur_idx
yield from sub_helper(cur_i+1, combo, all_options)
def sub_sequence_generator(sequence,sub_seq):
indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
for i,c in enumerate(sequence):
try:
indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
except KeyError:
pass
# now we have indices for each character of the sub_sequence
# we just need to put them in a sequence that corelates to the subsequence
chr_indices = tuple(indices_map[c] for c in sub_seq)
return sub_helper(0,[None]*len(chr_indices), chr_indices)
sequence = "abccabac"
subsequence = "abc"
for idxs in sub_sequence_generator(sequence,subsequence):
print(idxs)
In terms of memory this will make a list for each character of the subsequence, containing the indices in the main sequence where that character is present. A tuple of these lists, and a list of the indices that is continually updated and a tuple for each time a combination is yield
ed, and iterators like islice
so this is extremely memory efficient, however since I have not extensively tested it I cannot give any guarantee it is bug free.