Search code examples
cregexposix

Do extended regexes support back-references?


Wikipedia says that extended regexes "dropped support for backreferences", so the "basic" regex mode has to be used to enable those. However, it seems that a number of implementations do support backreferences for extended regexes. For example, with gcc 4.6 on Ubuntu Precise, they are supported. FreeBSD implementations seem to support them only in basic mode.

Boost says (and seems to agree with Wikipedia) that backreferences are not supported for extended regexes, but Boost::Regex adds them as an extension.

Is this just a poorly defined part of the standard which is interpreted differently by every implementation?


Solution

  • As others have already pointed out, it's quite clear that POSIX EREs do not support back-references.

    The rationale given in the OpenGroup Base Specifications Issue 7 for not adding back-references to EREs is given as:

    It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.

    Quoted from: Rationale: Base Definitions: Extended Regular Expressions

    The primary reason for this limitation is to allow POSIX EREs to be converted to a deterministic finite automata (DFA), and indeed the original implementation of EREs in Unix was done as a DFA. The use of a DFA allows guarantees to be made about the performance of the implementation. Pattern matching with (an unbounded number of) back references is an NP-hard problem, and perhaps even an NP-complete problem. Consensus in the POSIX standards committee could never have been reached if back-references were proposed for EREs because that would force all the companies using the original Unix implementation to change their code to a non-deterministic implementation and to drop their performance guarantees, and some of those companies had members on the committee.

    It has also been noted that back-references in REs are not intuitive for either users or implementors, and indeed they've caused extreme confusion more often than now. See for example the examples given in RE-Interpretation: The Dark Corners, specifically the text:

    RE#46 through RE#82 are nasty; backreferences are intuitive neither for the implementor nor the user.

    NOTE: back-references in REs are not the same as references to sub-patterns in substitution text in tools such as sed.