Search code examples
regexpython-3.xregex-group

Are long regular expressions worse than short ones?


I was trying to learn about regular expressions for a project where I want to create a textmate grammar, regexes seem relatively simple but really hard to read for me, so I tried to create a utility module hat could generate them, it kinda works as intended and generate regular expressions that actually work, all aliased by easy to understand names.

for example:

struc_enum = OrGroup("struct", "enum")
whitespace = TAB_SPACE.at_least(1)

results in:

(?:struct|enum)
[ \t]+

in this case, there's not much benefit in using python aliases but then I can do:

valid_name = r"\b" + Group(ALPHA, ALPHANUMERIC.repeated())
struc_enum = OrGroup("struct", "enum")
typed_name = (struc_enum + whitespace).optional() + valid_name + whitespace + valid_name.captured()

and ths is what print(typed_name) displays:

(?:(?:(?:struct|enum)[ \t]+)?\b[a-zA-Z][a-zA-Z\d]*[ \t]+(\b[a-zA-Z][a-zA-Z\d]*))

This method can be used to create small snippets and concatenate them to construct more complex patterns, but for each level of concatenation the expression grows exponentially large, such that I could easily get at this point:

(?:(func)[\s]+([a-zA-Z_]+[a-zA-Z\d_]*)[\s]*\([\s]*(?:[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?(?:[\s]*,[\s]*[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?)*[\s]*)?\))

In an atom grammar this big regex above can match lines like this, but it doesn't seem to work elsewhere:

func myfunc(asd asd*, asd*, asdasd)
func do_foo01(type arg1, int arg2)

With enough patience, a human might construct an equivalent expression but probably much shorter, which raises the question. Are big regular expressions worse or better than the equivalent shorter ones int terms of computational overhead? At which point can we consider regexes too big?


Solution

  • I think this is a fine idea, but you need to be clear with yourself about the scale of the project you're undertaking.

    We almost never need to use regular expressions; we could take apart every string and write our own parsing operations using starts_with and ifs etc. But regex syntax is a mature, powerful system that let's us succinctly express certain kinds of logic.

    Often regexes are hard to read. There are some tools that can help, but the idea of a less succinct system for doing the stuff we currently do with regexs is sound. The hard part will be replicating the breadth, power, and reliability of existing regex systems.

    My guess is you'll be best served by learning to tolerate the density of regular expressions. Possibly we might be better served by you building a easier-to-read system for sting-parsing, but you'll have about 20 years of catching up to do.

    Regarding performance: Regexes are (can be) compiled. Depending on the context, this can have a big performance benefit.
    Anyway, like any sufficiently powerful language, the length of the instruction is a poor indicator of it's run-time complexity.