For a given substring, I need to determine all the lengths, in order, of the repeating chains of that substring in a given string.
Example: for the substring ATT
and a string ATTATTATT GGG ATTATT GGG ATT
, I want to return (3,2,1)
.
I think I have a solution, but it's inelegant and potentially slow (written below). I wanted to use more_itertools.consecutive_groups()
on the start locations of the substring, but couldn't figure out how to adjust for the substring being longer than length 1.
spans = [i.span() for i in
finditer(substring,string)]
lengths = []
runninglength = 1
for i in range(len(spans)):
if i == len(spans)-1:
lengths.append(runninglength)
elif spans[i][1] == spans[i+1][0]:
runninglength += 1
else:
lengths.append(runninglength)
runninglength = 1
return tuple(lengths)
Is there a faster, less confusing way to accomplish this?
You could use re.findall
to find all the non-overlapping matches in the string, then divide the length of the captured matches by the length of the search string to get the number of consecutive matches. For example:
import re
s = 'ATTATTATT GGG ATTATT GGG ATT'
sub = 'ATT'
sl = len(sub)
regex = re.compile(f'((?:{sub})+)')
lens = [len(m) // sl for m in regex.findall(s)]
print(lens)
Output:
[3, 2, 1]