Search code examples
regexwhitespaceflex-lexer

How to skip blank spaces in Flex?


I'm using flex to generate my lexical analyzer for a college homework. It should recognize integers, floats, variables names and math operators. It also should ignore any blank like chars, like " ","\n","\t" and so on. To start, I'm only try to catch whitespace chars like this " " or more whitespace chars concatenated. My rule file is this:

%{
  #include<stdio.h>
%}
%%
[0-9]+ printf("inteiro:%s\n",yytext);
[0-9]+\.[0-9]+ printf("fracionário:%s\n",yytext);
[a-zA-Z][a-zA-z0-9]* printf("variável:%s\n",yytext);
\+|\-|\*|\/|\*\* printf("operador:%s\n",yytext);
\(|\) printf("parênteses:%s\n",yytext);
[[:space:]]|[[:space:]]+;
%%

with the following input

12 + 413

it generates this output:

inteiro:12 operador:+ inteiro:413

I would like to how why the last line couldn't be something like:

[[:space:]]+;


Solution

  • The rule

        [[:space:]]|[[:space:]]+;
    

    is a bit odd.

    A flex rule consists of a pattern and an (optional) action, separated by whitespace. Since there is no space before the ;, it is part of the pattern, not the action. So that pattern matches either a single whitespace character ([[:space:]]) or (|) a sequence of one or more whitespace characters followed by a semicolon ([[:space:]]+;).

    Since there is no action in that rule, the pattern is just ignored. (That's a flex extension. Lex requires that an action be present, even if it does nothing.) In effect, that means that you will ignore all whitespace (one character at a time) and you will also ignore semicolons if they are preceded by whitespace.

    What you probably intended was

        [[:space:]]+     /* sem ação */
    

    (It is useful to insert a comment to make the absence of action visible.)


    By the way, character classes are usually much more readable than a forest of leaning timber (that is, a bunch of backslash escapes). Also, flex lets you use double quotes to quote strings.

    So instead of

        \+|\-|\*|\/|\*\*    /* operador */
        \(|\)               /* parênteses */
    

    you could write:

        [-+*/]|"**"         /* operador */
        [()]                /* parênteses */
    

    In the first character class, it is important to put the - either at the beginning or the end of the list of characters so that it is not interpreted as defining a range of characters.

    And, instead of inserting your own debugging printf statements, consider using the -d (debug) option when building your scanner. That will print out complete debugging information for you, letting you see precisely what is being done by the scanner.