Search code examples
regexregular-language

What is the power of regular expressions?


As the name suggests we may think that regular expressions can match regular languages only. But regular expressions we use in practice contain stuff that I am not sure it's possible to implement with their theoretical counterparts. How for example would you simulate a back-reference? So the question arises: what is the theoretical power of the regular expressions we use in practice? Can you think of a way to match {(a^n)(b^n)|n>=0}? What about {(a^n)(b^n)(c^n)|n>=0}?


Solution

  • The answer to your question is, "regular expression" languages that allow back-references are neither regular nor context-free. (In other words, as you pointed out, you cannot simulate back-reference with a regular language, nor with a CFL.) In fact, Wikipedia says many of the "regular expression" languages we use in practice are NP-Complete:

    Pattern matching with an unbounded number of back references, as supported by numerous modern tools, is NP-complete (see,[11] Theorem 6.2).

    As others have suggested, the regular expression languages commonly supported in computer languages and libraries are a different animal from regular expressions in formal language theory. Larry Wall wrote in regard to Perl "regexes,"

    'Regular expressions' [...] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes"

    You asked,

    Can you think of a way to match {(a^n)(b^n)|n>=0}? What about {(a^n)(b^n)(c^n)|n>=0}?

    I'm not sure here if you're trying to test whether theoretical regular expression languages can match the "language of squares", or whether you're looking for an implementation in a (practical) regex language. Here's the proof why the former is not possible; and here's a long explanation and implementation of the latter for java regexes.