Search code examples
pythonsequencesequencessubsequence

Python: Sub sequence search


I have a sequence 'abccabac' and a subsequence 'abc'. I need to get indices of all occurences of subsequence 'abc' in the sequence. What is the memory efficient way of doing it?

Example:

Input:

**sequence**: 'abccabac' **subsequence**: 'abc'  

Output:

 012   
 013  
 017  
 057  
 457

Solution

  • the absolute most strait forward way I can think of is using itertools.combinations

    import itertools
    
    sequence = "abccabac"
    
    subsequence = "abc"
    
    for combo in itertools.combinations(enumerate(sequence), len(subsequence)):
        if "".join(pair[1] for pair in combo) == subsequence:
            print([pair[0] for pair in combo])
    

    If your actual sequence contains lots of characters that are not even in the subsequence then filtering out the irrelevant characters before starting the combinations would definitely provide a substantial efficiency boost:

    char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
    for combo in itertools.combinations(char_generator, len(subsequence)):
        ...
    

    as well instead of joining each combination you can just use all which will only check characters until one is found to be not equal:

    char_generator = (pair for pair in enumerate(sequence) if pair[1] in subsequence)
    for combo in itertools.combinations(char_generator, len(subsequence)):
        if all(pair[1]==char for pair,char in zip(combo,subsequence)):
            print([pair[0] for pair in combo])
    

    Here is a different alternative I just put together, I imagine the algorithm could be optimized even more but this is about the best I could come up with. The sub_sequence_generator creates a list of indices for each letter of the sub sequence, then the sub_helper for each level of the recursion goes through all indices for the next letter of the subsequence starting after the indice of the last letter.

    import itertools
    import bisect
    
    def sub_helper(cur_i, combo, all_options):
        #cur_i is the index of the letter in the subsequence
        #cur_idx is the index of the sequence which contains that letter
        if cur_i>0:
            prev_i = combo[cur_i-1]
        else:
            prev_i = -1
        cur_options = all_options[cur_i]
        start_i = bisect.bisect(cur_options,prev_i)
        cur_i_gen = itertools.islice(cur_options,start_i,None)
        if cur_i+1 == len(all_options): #last one:
            for cur_idx in cur_i_gen:
                combo[cur_i] = cur_idx
                yield tuple(combo)
        else:
            for cur_idx in cur_i_gen:
                combo[cur_i] = cur_idx
                yield from sub_helper(cur_i+1, combo, all_options)
    
    
    def sub_sequence_generator(sequence,sub_seq):
        indices_map = {c:[] for c in set(sub_seq)} #create a list for each character
        for i,c in enumerate(sequence):
            try:
                indices_map[c].append(i,) #the helper is simpler if they are all stored as single element tuples
            except KeyError:
                pass
    
        # now we have indices for each character of the sub_sequence
        # we just need to put them in a sequence that corelates to the subsequence
        chr_indices = tuple(indices_map[c] for c in sub_seq)
    
        return sub_helper(0,[None]*len(chr_indices), chr_indices)
    
    sequence = "abccabac"
    subsequence = "abc"
    
    for idxs in sub_sequence_generator(sequence,subsequence):
        print(idxs)
    

    In terms of memory this will make a list for each character of the subsequence, containing the indices in the main sequence where that character is present. A tuple of these lists, and a list of the indices that is continually updated and a tuple for each time a combination is yielded, and iterators like islice so this is extremely memory efficient, however since I have not extensively tested it I cannot give any guarantee it is bug free.