I am writing a cpp program that is a lexical analyzer for a cpp-like language. To find each token, I use a regex to match and then decide to choose the right token.
Strings in this language are exactly like cpp. The regex that I use is like this:
\"([^\\\"]|\\.)?\"
But it is not really correct. For an input like this:
"String \" int"
The output should be one string token, but with my regex I get one string token ("String ") and an int keyword, and then an error.
Do you have any idea how to handle this? Or how should I change the regex?
P.S. : I use regex_search() to find the match.
Thank you.
You may use
std::regex rx(R"(\"[^\"\\]*(?:\\.[^\"\\]*)*\")");
The pattern is "[^"\\]*(?:\\.[^"\\]*)*"
:
"
- a double quote[^"\\]*
- zero or more chars other than a double quote and backslash(?:\\.[^"\\]*)*
- zero or more repetitions of
\\.
- any char with a backslash in front (replace .
with [\s\S]
to if you need to also support escaped line breaks)[^"\\]*
- zero or more chars other than a double quote and backslash"
- double quote.See the regex demo.