Search code examples
pythonpython-3.xstringsubstringpython-itertools

Return lengths of repeating substring


For a given substring, I need to determine all the lengths, in order, of the repeating chains of that substring in a given string.

Example: for the substring ATT and a string ATTATTATT GGG ATTATT GGG ATT, I want to return (3,2,1).

I think I have a solution, but it's inelegant and potentially slow (written below). I wanted to use more_itertools.consecutive_groups() on the start locations of the substring, but couldn't figure out how to adjust for the substring being longer than length 1.

spans = [i.span() for i in 
         finditer(substring,string)]
lengths = []
runninglength = 1
for i in range(len(spans)):
    if i == len(spans)-1:
        lengths.append(runninglength)
    
    elif spans[i][1] == spans[i+1][0]:
        runninglength += 1
    
    else:
        lengths.append(runninglength)
        runninglength = 1
    return tuple(lengths)

Is there a faster, less confusing way to accomplish this?


Solution

  • You could use re.findall to find all the non-overlapping matches in the string, then divide the length of the captured matches by the length of the search string to get the number of consecutive matches. For example:

    import re
    
    s = 'ATTATTATT GGG ATTATT GGG ATT'
    sub = 'ATT'
    sl = len(sub)
    
    regex = re.compile(f'((?:{sub})+)')
    
    lens = [len(m) // sl for m in regex.findall(s)]
    print(lens)
    

    Output:

    [3, 2, 1]