Search code examples
c++regexflex-lexerlexlexical-analysis

Flex regular expression for strings with either single or double quotes


I am writing a regular expression for a lexical analyzer for class. I presently have a regular expression written for regular strings with double quotes, however my professor wants us to account for single quotes, too.

Here is my current regular expression:

\"([^\"\\]|\\.)*\"

I am unsure as to how to make it accept both kinds.

Thank you in advance!


Solution

  • (F)lex really doesn't have any mechanism which will accept two different kinds of quotes, other than putting the two patterns together with |. Usually, it's more readable to just write multiple patterns:

    ["]([^"\\\n]|\\(.|\n))*["]   { /* A double-quoted string with escapes and splices */ }
    [']([^'\\\n]|\\(.|\n))*[']   { /* A single-quoted string with escapes and splices */ }
    

    (The main difference between the above and your pattern is that it follows C rules: it doesn't allow newlines in quoted strings, but it does allow "splices": line continuations consisting of a backslash followed by a newline. If your language doesn't have those, you should go back to your original formulation, but continuation lines are pretty common in programming languages. What you need to remember always is that . does not match a newline, while [^...] does unless newline is specifically excluded.)

    That's not really sufficient though, because it won't match unterminated quoted literals. If the lexer sees a quote and then tries to match the unterminated literal, it will fail at the end of the line (or at the end of the first non-spliced line), and fall back to the initial quote. If you follow the usual technique of a fall-back pattern which returns a character literal:

    .|\n       { return *yytext; }
    

    then the unterminated strings will be sent to your parser as token characters which the parser isn't expecting; that will make it hard to produce a meaningful error message and impossible to do any kind of error recovery.

    It's usually best to add fallback unterminated string patterns (which are the same as the correct patterns except that they are missing the terminating quote) in order to:

    • avoid the backtrack, and
    • correctly detect the error.

    Just a suggestion.