Search code examples
pythonpython-itertools

Find consecutive characters in a string + their start and end indices (python)


I will be working with strings approximately 365 characters in length. Within these strings, I want to find consecutive runs of the character '-' as well as the start and end index of each consecutive run. This should include instances where the character occurs only once.

Consider the following string: 'a---b-cccc-----'. I would like to know that there is a consecutive run of three '-' characters, then one occurrence, then another five '-' characters in a row. I would also like to know their start and end positions. Reporting the results in a list of tuples (start, end, number of consecutive) would be fine, e.g.:

[(1,4,3), (5,5,1), (10,14,5)]

I've considered using itertools combined with enumerate. However, I can't quite get it right. I patched this together based on previous questions, but it's missing the start index:

counts=[]
count=1
for idx, (a,b) in enumerate(itertools.zip_longest(s, s[1:], fillvalue=None)):
    if a==b=="-":
        count += 1
    elif a!=b and a =="-":
        counts.append((idx,count))
        count = 1
print(counts)

Output:

[(3, 3), (5,1), (14,5)]

I pieced together the following from other questions:

g = groupby(enumerate(s), lambda x:x[1])
l = [(x[0], list(x[1])) for x in g if x[0] == '-']
[(x[1][0][0],x[1][-1][0], len(x[1])) for x in l]

Output:

[(1, 3, 3), (5, 5, 1), (10, 14, 5)]

It seems to work, but I don't particularly understand the logic of it, and I'm not sure if it will always work. Is there a better way? Or is this as efficient as it will get? I will need to be performing the search hundreds of thousands of times, so efficiency is key here.

Thanks!


Solution

  • One way using re.finditer:

    [(*m.span(), len(m.group(0))) for m in re.finditer("-+", s)]
    

    Output:

    [(1, 4, 3), (5, 6, 1), (10, 15, 5)]