I'm learning flex/bison for parsing technology. Book flex & bison shows such an flex example:
UCN (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
{%
...
%}
%%
\"([^\"\\]|\\['"?\\abfnrtv]|\\[0-7]{1,3}|\\[Xx][0-9a-fA-F]+|{UCN})+\" { ... save token here }
%%
I have no idea about these parts inside of the regex:
[^\"\\]
means do not match \"
or \\
? if so, why need specify this don't match? since \"
and \\
seems not appear in the other group.\\[0-7]{1,3}
mean?\\[Xx][0-9a-fA-F]
mean?UCN
mean utf-8?That regular expression matches the following:
"
character, [^\"\\]
- Any character other than "
or \
\\['"?\\abfnrtv]
- A \
followed by any of '
, "
, ?
, \
, a
, b
, f
, n
, r
, t
, or v
.\\[0-7]{1,3}
- A \
followed by one to three octal digits.\\[Xx][0-9a-fA-F]+
- A \
followed by X
or x
followed by one or more hexadecimal digits.{UCN}
, which expands to (\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
- Either of the following:
\\u[0-9a-fA-F]{4}
- A \
followed by u
followed by four hexadecimal digits\\U[0-9a-fA-F]{8}
- A \
followed by U
followed by eight hexadecimal digits"
characterNote that this isn't actually a correct pattern for matching all C++ string literals because
""
)x
. A better pattern for matching those would be \\x[0-9a-fA-F]+
For more info about what all of the C++ escape sequences mean, see this page.
To answer your specific questions:
\
denotes an escape sequence, which is handled by the other options, and an un-escaped "
means the end of the string literal. The generic "any character" match doesn't match either of those characters so that they can be matched by other parts of the expression.\\[0-7]{1,3}
means a \
followed by one to three octal digits.\\[Xx][0-9a-fA-F]+
means a \
followed by X
or x
followed by one or more hexadecimal digitsUCN
is short for Universal Character Name. It denotes a Unicode character, but doesn't say anything about its encoding.