Search code examples
re2c

re2c: syntax error when trying to match string


I'm trying to use re2c, but it gives me a syntax error on this regex:

(["'])((\\{2})*|(.*?[^\\](\\{2})*))\1

What's wrong with it? This should match a doubly quoted or single quoted string


Solution

  • Re2c, like most scanner generators, only implements regular expression primitives which can be implemented in linear time without backtracking. As a consequence, it does not implement back-references, captures (although you can insert tag markers into the regular expression) or non-greedy matches. (Technically, non-greedy matches can be implemented in linear time. But it's a bit tricky getting it right.)

    The back-reference in your regex is really just an abbreviation, since it can only take on two values: either it is ["] or [']. Separating out the two alternatives also makes it easy to avoid the need for the non-greedy match:

    1. If you don't allow newlines in strings:

      ["]([^"\\\n]|\\.)*["]|[']([^'\\\n]|\\.)*[']
      
    2. If you allow newlines in strings only if they are escaped:

      ["]([^"\\\n]|\\(.|\n))*["]|[']([^'\\\n]|\\(.|\n))*[']
      
    3. If you allow newlines anywhere in strings:

      ["]([^"\\]|\\(.|\n))*["]|[']([^'\\]|\\(.|\n))*[']
      

    (Like flex, re2c considers . to match any character other than a newline, while negative character classes do include newlines unless specifically mentioned. So newline handling often needs to be explicit.)

    Note that re2c does implement quoted literal strings (like flex), so " and ' are metacharacters, a feature which appears in very few regex libraries. (Unlike flex, single-quoted strings are accepted and treated as case-insensitive. In flex, a single quote not a metacharacter.) The consequence is that you must escape them to make them literal characters; the usual convention is to put them inside a character class, as above, rather than using falling timber representation which can be hard to read.

    Please read the documentation for re2c patterns, which may have some significant differences to the regex libraries you are used to (or for which you have found examples in a web search).