python python-3.x string substring python-itertools

Return lengths of repeating substring

For a given substring, I need to determine all the lengths, in order, of the repeating chains of that substring in a given string.

Example: for the substring ATT and a string ATTATTATT GGG ATTATT GGG ATT, I want to return (3,2,1).

I think I have a solution, but it's inelegant and potentially slow (written below). I wanted to use more_itertools.consecutive_groups() on the start locations of the substring, but couldn't figure out how to adjust for the substring being longer than length 1.

spans = [i.span() for i in 
         finditer(substring,string)]
lengths = []
runninglength = 1
for i in range(len(spans)):
    if i == len(spans)-1:
        lengths.append(runninglength)
    
    elif spans[i][1] == spans[i+1][0]:
        runninglength += 1
    
    else:
        lengths.append(runninglength)
        runninglength = 1
    return tuple(lengths)

Is there a faster, less confusing way to accomplish this?

Solution

You could use re.findall to find all the non-overlapping matches in the string, then divide the length of the captured matches by the length of the search string to get the number of consecutive matches. For example:

import re

s = 'ATTATTATT GGG ATTATT GGG ATT'
sub = 'ATT'
sl = len(sub)

regex = re.compile(f'((?:{sub})+)')

lens = [len(m) // sl for m in regex.findall(s)]
print(lens)

Output:

[3, 2, 1]