Real lex rookie . I'm trying to use regular expression , to identify strings in a printf()
such as printf("hello world!");
, but the best result is just "hello world!" , and I don't want the double quotation marks , just hello world! How should I do ?
The regex so far is: ("\"")(.)*("\"")
Good regular expression to match string literals are:
["]([^"]|\\(.|\n))*["]
["]([^"\n]|\\(.|\n))*["]
The first one fails on multiline strings; the second one accepts them. In both cases, unmatched quotation mark won't be matched do you'll need to deal with those erroneous inputs with some other pattern. Both patterns accept backslash escapes (including backslash-escaped newlines) without making any attempt to interpret them. Most real-life lexical scanners will want to process backslash escape sequences in some fashion, often by turning them into the characters they represent. But that requires a different mechanism, which is out of scope for this question.
As you have discovered, the match includes the quotation marks, so you will want to remove them. Since you normally must make a copy of the matched token (since the contents of yytext
will be overwritten the next time the scanner is called), that can easily be done by simply copying the part of the match you are interested in.
Remember that yyleng
is the length of the token. Consequently the substring you want starts at yytext + 1
(to skip over the opening quote) and continues for yyleng - 2
characters (to not include either quote):
["]([^"]|\\(.|\n))*["] {
yylval.str = malloc(yyleng -1);
memcpy(yylval.str, yytext + 1, yyleng - 2);
yylval.str[yyleng - 2] = 0;
}
There are other ways to write that, of course, but they will all be similar.