Search code examples
pythonregexalgorithmparsingmarkup

Regex/matching engine with completely overlapping results and "cursor" manipulation


I'm looking for something somewhat similar to a regex engine but which allows for completely overlapping results and allows manipulation of the internal "cursor" while the engine is returning matches.

The normal regex way:

Say you have a normal regex with various "alternatives": item1|item2|item3, and you use either findall or finditer to get all matches. At a certain position in the input string, the engine may match any of those alternatives. Once found, the cursor is advanced to the index right after the end of the match and it continues to look for any of the alternatives. Even if two or more of those alternatives might have matched the string at the initial cursor's position, only one is returned:

import re
input_string = 'foobar'
compiled = re.compile('foobar|foo|bar')
compiled.findall(input_string)

# 'foobar'  

What I want (1):

I want them all to be returned. Like so:

import muchneededregexthing
input_string = 'foobar'
compiled = muchneededregexthing.compile('foobar|foo|bar')
searcher = compiled.create_searcher(input_string)
while not searcher.reached_end():
    match = searcher.search() # increments searcher's internal cursor 
                              # to after the end of the match
    print(match.string, match.span())

# foobar (0,6)
# foo (0,3)
# bar (3,6)

What I also want (2):

I want to be able to modify the searcher's cursor, so that I can manipulate the results according to what happens at runtime ('foobar' is matched, and 'foo' and 'bar' separately don't matter).

import muchneededregexthing
input_string = 'foobarkitten'
compiled = muchneededregexthing.compile('foobar|foo|bar|kitten')
searcher = compiled.create_searcher(input_string)
while not searcher.reached_end():
    match = searcher.search() # increments searcher's internal cursor 
                              # to after the end of the match
    print(match.string, match.span())
    if match.string == 'foobar':
        searcher.advance_cursor(match.end())

# foobar (0,6)
# kitten (6,12)

What I cannot use (most probably):

  • str.find: I need the regex thing for a markup (markdown/wiki/etc.) parser. At any time, many different elements may have to be looked for. Using str.find I'd need to search the whole document with every element that the markup supports. The language is too complex to subdivide the document into chunks like a "header chunk", "paragraph chunk", etc.

  • re: Unless re magically supports what I need, I cannot use it for the following reason: regexes are limited, you can't match anything you like. Yet their properties are useful for cases like mine. I plan on matching in two stages: the regex provides a possible match for an element. A smart function/method is then consulted on whether the possible match is a good match. If so, great, advance cursor. If not, screw the foobar and give foo and bar a chance on their own.

Ideas are very welcome. I'm fairly certain I need said features. My best idea so far is to write my own muchneededregexthing module with support for most of the regex syntax in C/C++, so don't fear that I will disregard your idea as far-fetched.

Edit 1: request for example of markup:

Markup elements, and therefore the tokens that need to be matched, are defined and brought in by means of plugins. Therefore, the framework does not contain any markup in itself. I could simply match a plugin's token with a regex and be done with it, but I want to at least explore the options and try to allow a greater range of markup tokens than what one would be able to support with a regex. For example, how would one match string:number if their relation would be that the number should be some numerical representation of the string? a:0 is a valid token, but a:1 is not. b:1 is, however, and so is bc:28 (126 + 21).

For this example, the plugin could provide a regex such as ([a-z]{1,5})([0-9]{1,5}). The algorithm would then pass a match to a special function that calculates the numerical value of the first group and compares that to the value of the second group. If those values match, than plugin will handle this part of the document. If not, it returns and an attempt is made to have another plugin handle this index in the document.


Solution

  • For those future wanderers out there who seek answers to this question: I came across hyperscan recently which does exactly what I was looking for. It's in C, but there is a Python port for it (I didn't test the port, but the native C version works very well).

    While most regex engines will match only one subpattern on a given offset in the input string, Hyperscan will report all subpatterns that matched for each offset. See also hyperscan semantics.