Search code examples
c++cregexflex-lexerlex

How to understand the flex C/C++ string literal regex?


I'm learning flex/bison for parsing technology. Book flex & bison shows such an flex example:


UCN (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})

{%
...
%}

%%

\"([^\"\\]|\\['"?\\abfnrtv]|\\[0-7]{1,3}|\\[Xx][0-9a-fA-F]+|{UCN})+\" { ... save token here }

%%

I have no idea about these parts inside of the regex:

  1. does [^\"\\] means do not match \" or \\? if so, why need specify this don't match? since \" and \\ seems not appear in the other group.
  2. what does \\[0-7]{1,3} mean?
  3. what does \\[Xx][0-9a-fA-F] mean?
  4. does UCN mean utf-8?

Solution

  • That regular expression matches the following:

    • A " character,
    • Followed by any combination of one or more of the following:
      • [^\"\\] - Any character other than " or \
      • \\['"?\\abfnrtv] - A \ followed by any of ', ", ?, \, a, b, f, n, r, t, or v.
      • \\[0-7]{1,3} - A \ followed by one to three octal digits.
      • \\[Xx][0-9a-fA-F]+ - A \ followed by X or x followed by one or more hexadecimal digits.
      • {UCN}, which expands to (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}) - Either of the following:
        • \\u[0-9a-fA-F]{4} - A \ followed by u followed by four hexadecimal digits
        • \\U[0-9a-fA-F]{8} - A \ followed by U followed by eight hexadecimal digits
    • Followed by a closing " character

    Note that this isn't actually a correct pattern for matching all C++ string literals because

    • It doesn't match the empty string ("")
    • Hex escape codes must begin with a lower-case x. A better pattern for matching those would be \\x[0-9a-fA-F]+

    For more info about what all of the C++ escape sequences mean, see this page.

    To answer your specific questions:

    1. \ denotes an escape sequence, which is handled by the other options, and an un-escaped " means the end of the string literal. The generic "any character" match doesn't match either of those characters so that they can be matched by other parts of the expression.
    2. Answered above: \\[0-7]{1,3} means a \ followed by one to three octal digits.
    3. Answered above: \\[Xx][0-9a-fA-F]+ means a \ followed by X or x followed by one or more hexadecimal digits
    4. UCN is short for Universal Character Name. It denotes a Unicode character, but doesn't say anything about its encoding.