c flex-lexer lex lexical-analysis lexical

Lexical analysis using lex

Before I will show what I have done here is the assignment I have tried to do(I'm new so im not really sure if im doing it allright).

1.  Implement lexical analyzer (using FLEX), as follows:
- Lexical analyzer supplies services next_token(), back_token()
- Lexical analyzer reads text from the input file and identifies tokens. This 
   happens when function   next_token() is called. 
- When a token is identified in the input text, it should be stored in a data 
   structure. For each token, the following attributes are saved:
   * token type
   * token lexeme
   * number of the line in the input text in which this token was found.
- Blanks, tabs, new lines – are not tokens, and should be ignored
- For each token, print (on a separate line) its type (e.g. rel_op , number , etc.) 
   and lexeme 
- Each operation, keyword, separation sign and each type of number should be 
   implemented as a token of a different kind
- Kinds of tokens are coded with integer numbers, for example:
        # define  ID_tok  1
        # define COMMA_tok  2

using Flex I have written this:

%{
#include<stdio.h>
int line=1;
# define  ID_tok  1
# define COMMA_tok  2
#define REL_OP_tok 3
#define NUMBER_tok 4
#define KEYWORDS_tok 5
%}
binary_ar_op "*"|"/"|"+"|"-"
rel_op "=="|"!="|">"|"<"|"<="|">="
id      [a-z][a-z0-9]*
number    ["+"|"-"]?[0-9]*("."[0-9]+)?
keywords "prabegin"|"parend"|"task"|"begin"|"end"|"integer"|"real"|"do"|"until"|"od"|"send"|"accept"

%%
\n   {line++;  printf("\n%d:",line);}

{binary_ar_op}+ {printf( "An binary_ar_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{rel_op}+ {printf( "An rel_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{id}+ {printf( "An id: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}


{number}+ {printf( "An number: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}
%%
int yywrap()
{
return 1;
}
main()
{
printf("Enter a string of data\n");
yylex();
}

As you can see I already defined all the types I was suppost to, I do not understand how to implement next and back, some guideness whould be great,also I saved the line number , but I guess they mean other then what I wrote, I also dont understand the define part they wrote about.

I know that this question seems odd, but I got this assignment without any guidness or explanation before, so I'm learning it on my own and I dont really understand all(Although I know the theory, thank you!).

Solution

I did something very similar in our company project(s).

About tokens

I make enumerations for them and...

About next_token()

My intention was to store the whole token related information into an object with:

the acutal token (an enumeration value)
the lexeme (a std::string)
the file position (consisting of a file name pointer, the line, and the column).

Additionally, I wanted to use smart pointers with these generated objects, not to mention, they should be C++ objects.

This is what I realized:

It is easy to redefine the yylex() function. Thus, you can even rename it and change its signature.
It is very difficult (if not impossible) to put this together with yacc/bison. The main issue was that data is passed from lex (generated code) to yacc/bison (generated code) using a C union (%union if I remember right). A C union and C++ objects -- that doesn't work well. (One object in C union may work but multiple definitely not.)

For my luck, the 2nd issue was actually not existing for me because I use flex but write (meanwhile generate) recursive descent parsers (directly in C++).

So, how to solve the 1st issue? This is from my code:

/// redefines proto for generated yylex function.
#define YY_DECL \
  RF::YAMS::Sim::ActionScript::RefToken \
  RF::YAMS::Sim::ActionScript::Compiler::lex(yyscan_t yyscanner)

This is the flex man page where you find the documentation. To find explanation how to redefine the yylex function, please, search on this website for "YY_DECL".

My parser calls lex() whenever it needs a new token.

Notes:

In my case, I renamed yylex() and even made it a method of my parser class. (I did this to simplify access of lexer to private parser variables.)
I provided ful scope operators because the generated lex code does not consider any namespace I use in my C++ code.
The yyscan_t yyscanner parameter has to be there because I generate re-entrant scanners. You have to decide whether or not it should be there. (Instead you could provide other arguments also.)
The returned RefToken is a smart pointer to the produced token. (Smart pointers makes it very easy to produce and consume tokens in different "places" without the danger of memory leaks.)

If the generated lexer shall be combined with a bison-generated parser it is probably not as easy. In this case, I would use static variables and organize queues to pass values from lexer to parser. This would work but is, of course, not as elegant and save as the above method.

About back_token()

Once you have a parser which consumes the tokens you may do with them whatever you want. In my case, one of the requirements was an option for back-tracking. Thus, from time to time I have to push back tokens to input. For this, I simply stack them in the parser. If a new token is required I check 1st whether this stack is empty. If not the uppermost token is popped else the lex() method is called to obtain a really new token. I guess a similar solution could be used to implement back_token() in your case.

About blanks

There are actually two types of rules (i.e. rule actions) in my lexer:

actions which end up with return new Token(...);
actions which end up with break;

The latter I use to consume separators (i.e. blanks etc.) and even comments (the parser even does not see them). This works because the lexer is actually nothing else than a switch() wrapped in a for() loop. (I learnt this "trick" from the flex doc. where it is mentioned explicitly somewhere.)

What else...

Beside of YY_DECL, I redefined YY_INPUT also. I did this to use the lexer with C++ std::stream (instead of yyin).

IMHO flex does provide a very comprehensive manual. However, whenever I'm in doubt I look into the C file generated by flex. There are these horrible int arrays for the finite automaton which I usually simply ignore. The rest is the infra-structure around them, and you will find your C actions (written in the lex rules) somewhere embedded. Examining the code around may make things clearer.