Search code examples
c++compiler-constructionflex-lexerlexical-analysis

Matching `\` in Flex


I am trying to create a simple state machine in flex which has to ensure that strings spanning multiple lines must have \ for line breaks. Concretely:

"this is \
ok"

"this is not
ok"

The first one is valid. The second one is not.

I have the following state machine:

expectstring     BEGIN(expectstr);
<expectstr>[^\n]     {num_lines++;}
<expectstr>\         {flag = true;}
<expectstr>\n        {printf("%s\n", flag ? "True" : False);}

But when I try to compile this state machine, flex tells me that the rule with \ can not be matched. Why is that?

I have looked at this but cannot figure it out.


Solution

  • In flex, the following pattern matches anything other than a newline:

    .
    

    You can also write that as

    [^\n]
    

    but . is more normal.

    In order to match a backslash you can write

    \\
    "\\"
    [\\]
    

    Again, the first would be the usual way.

    It's important to understand that [...] is an way of representing a set of characters, and that most regular expression operators are just ordinary characters inside the brackets. Similarly, "..." is a way of representing a sequence of characters and most regular expression operators are just ordinary characters inside the quotes.

    Thus,

    • [a|b] matches one character if it is an a, a |, or a b
    • "a|b" matches the three-character sequence a | b
    • and|but matches either of the three-character sequences and or but.

    Since flex lets you match regular expressions, you really don't need to manually build a state machine. Just use an appropriate regular expression. For example, the following will match strings which start and end with ", in which \ may be used to escape itself as well as newlines, and in which newlines (other than escaped ones) are illegal. I think that's your goal.

    \"([^"\n\\]|\\(.|\n))*\"
    

    You should make sure you understand how it works; there are lots of good explanations of regular expressions on the internet (and even more bad ones, so try to find one written by someone who knows what they are talking about). Here's the summary:

    \"     A literal double-quote
    (...)* Any number of repetitions of:
      [^"\n\\]   Anything other than a double-quote, newline, or backslash
      |          Or
      \\         A literal backslash, followed by
      (...)      Grouping
        .          Anything other than a newline
        |          Or
        \n         a newline