Search code examples
c++regexanalysislexical

How to express cpp-like string with regex (lexical analysis)


I am writing a cpp program that is a lexical analyzer for a cpp-like language. To find each token, I use a regex to match and then decide to choose the right token.

Strings in this language are exactly like cpp. The regex that I use is like this:

\"([^\\\"]|\\.)?\"

But it is not really correct. For an input like this:

"String \" int"

The output should be one string token, but with my regex I get one string token ("String ") and an int keyword, and then an error.

Do you have any idea how to handle this? Or how should I change the regex?

P.S. : I use regex_search() to find the match.

Thank you.


Solution

  • You may use

    std::regex rx(R"(\"[^\"\\]*(?:\\.[^\"\\]*)*\")");
    

    The pattern is "[^"\\]*(?:\\.[^"\\]*)*":

    • " - a double quote
    • [^"\\]* - zero or more chars other than a double quote and backslash
    • (?:\\.[^"\\]*)* - zero or more repetitions of
      • \\. - any char with a backslash in front (replace . with [\s\S] to if you need to also support escaped line breaks)
      • [^"\\]* - zero or more chars other than a double quote and backslash
    • " - double quote.

    See the regex demo.