Search code examples
regexstringlexflex-lexerstring-literals

multiple String literal in flex


I'm using flex to parse a whole buncha stuff, but I hit a roadbloack when I tried to detect two string literals on the same line.

my regex:

["].*["]

heres what I mean:

"cats" < "dogs"

is being recognized as one long string

cats" < "dogs

Why is flex only considering the two most outer quotations, instead of making two separate sets? Im certain that the problem lies in my regex, so what I'm essentially asking is:

How do I write a regex that, in this scenario, would recognize the tokens STRING, LESS, STRING as opposed to just STRING?


Solution

  • I suppose you're using a pattern like this:

    ["].*["]              { return STRING; }
    

    Or perhaps

    ["].*?["]             { return STRING; }
    

    The first one won't work because flex always takes the longest match, and the match using the last " is obviously longer. The second one would be correct in a regular expression library which implements non-greedy repetition, but flex does not; in flex, .*? is just an optional .* (which is to say, the ? is a no-op.)

    What you actually want is to match strings of characters other than quotes. So you can just say that:

    ["][^"]*["]           { return STRING; }
    

    [^"] will match a newline character, unlike .. If you didn't want multi-line strings, you'd have to use [^"\n].

    Obviously, the above doesn't allow " to appear in strings, which sooner or later will be annoying. Two popular solutions to this problem are (C-style) to allow \ to "escape" the next character: ("a \" in a string")

    ["]([^"]|\\.)*["]     { return STRING; }
    

    or (SQL-style) to require that internal " be doubled: ("a "" in a string"`)

    ["]([^"]|["]["])*["]  { return STRING; }