Search code examples
regexstringlex

Lex - double quotation mark inside string


I have lex grammar that contains rules for double quotes string:

...
%x DOUBLEQUOTE
...
%%
"\""                { yylval->string = NULL; BEGIN(DOUBLEQUOTE); }
<DOUBLEQUOTE> {
    "\n"            {
                        /* reset column counter on new line */
                        PARSER->linepos = 0;
                        (PARSER->linenum)++;
                        expr_parser_append_string(PARSER, &(yylval->string), yytext);
                    }
    [^\"\n]+        { expr_parser_append_string(PARSER, &(yylval->string), yytext); }
    "\\\""          { expr_parser_append_string(PARSER, &(yylval->string), yytext); }
    "\""            {
                        BEGIN(INITIAL);
                        if ( yylval->string != NULL )
                            string_unescape_c(yylval->string);
                        return ( TOKEN_STRING );
                    }
}

Somehow the escape sequence \" is matched only at beginning of a string. If the \" appears latter in a string it looks like the characters \ and " are matched separately.

For instance:

  1. Passes: "\" "

  2. Fails: " \" "

  3. Fails: "This is string example: \"a string inside of string\""

Why the escape sequence \" is not matched by the rule "\\\"" when appears latter in a string?


Solution

  • If the backslash is not the first character in the quoted string, then the backslash will be matched at the end of some token. For example:

     "abc\"def"
      ^^^^       First token, longest match of [^"\n]+
          ^      Terminates quoted string
    

    So you need to exclude backslashes as well. But once you do that, you need to provide a pattern which does match backslash escapes, not just backslash-escaped quotes. So I'd suggest:

    <DOUBLEQUOTE>{
      \\?\n              { /* Handle newline */ }
      ([^"\\\n]|\\.)+    { expr_parser_append_string(PARSER,
                                                     &yylval->string,
                                                     yytext); }
      \"                 { BEGIN(INITIAL); ... }
    }
    

    Note: I added an optional backslash to the beginning of the first pattern, in order to handle the case where the backslash immediately precedes a newline character. The . in the second pattern (\\.) will not match a newline so otherwise backslash-newline wouldn't be recognized at all.