I recently started using Lex, as a simple way to explain the problem I encoutered, supposing that I'm trying to realize a lexical analyser with Flex that print all the letters and also all the bigrams in a given text, that seems very easy and simple, but once I implemented it, I've realised that it shows bigrams first and only shows letters when they are single, example: for the following text
QQQZ ,JQR
The result is
Bigram QQ
Bigram QZ
Bigram JQ
Letter R
Done
This is my lex code
%{
%}
letter[A-Za-z]
Separ [ \t\n]
%%
{letter} {
printf(" Letter %c\n",yytext[0]);
}
{letter}{2} {
printf(" Bigram %s\n",yytext);
}
%%
main()
{ yylex();
printf("Done");
}
My question is How can realise the two analysis seperatly, knowing that my actual problem isn't as simple as this example
Lexical analysers divide the source text into separate tokens. If your problem looks like that, then (f)lex is an appropriate tool. If your problem does not look like that, then (f)lex is probably not the correct tool.
Doing two simultaneous analyses of text is not really a use case for (f)lex. One possibility would be to use two separate reentrant lexical analysers, arranging to feed them the same inputs. However, that will be a lot of work for a problem which could easily be solved in a few lines of C.
Since you say that your problem is different from the simple problem in your question, I did not bother to either write the simple C code or the rather more complicated code to generate and run two independent lexical analysers, since it is impossible to know whether either of those solutions is at all relevant.
If your problem really is matching two (or more) different lexemes from the same starting position, you could use one of two strategies, both quite ugly (IMHO):
I'm assuming the existence of handler functions:
void handle_letter(char ch);
void handle_bigram(char* s); /* Expects NUL-terminated string */
void handle_trigram(char* s); /* Expects NUL-terminated string */
For historical reasons, lex implements the REJECT
action, which causes the current match to be discarded. The idea was to let you process a match, and then reject it in order to process a shorter (or alternate) match. With flex, the use of REJECT
is highly discouraged because it is extremely inefficient and also prevents the lexer from resizing the input buffer, which arbitrarily limits the length of a recognisable token. However, in this particular use case it is quite simple:
[[:alpha:]][[:alpha:]][[:alpha:]] handle_trigram(yytext); REJECT;
[[:alpha:]][[:alpha:]] handle_bigram(yytext); REJECT;
[[:alpha:]] handle_letter(*yytext);
If you want to try this solution, I recommend using flex's debug facility (flex -d ...
) in order to see what is going on.
See debugging options and REJECT documentation.
The solution I would actually recommend, although the code is a bit clunkier, is to use yyless()
to reprocess part of the recognised token. This is quite a bit more efficient than REJECT
; yyless()
just changes a single pointer, so it has no impact on speed. Without REJECT
, we have to know all the lexeme handlers which will be needed, but that's not very difficult. A complication is the interface for handle_bigram
, which requires a NUL-terminated string. If your handler didn't impose this requirement, the code would be simpler.
[[:alpha:]][[:alpha:]][[:alpha:]] { handle_trigram(yytext);
char tmp = yytext[2];
yytext[2] = 0;
handle_bigram(yytext);
yytext[2] = tmp;
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]][[:alpha:]] { handle_bigram(yytext);
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]] handle_letter(*yytext);