I'm trying to create a scanner with flex that acts somewhat like grep
.
Basically, what I want to do is: given a word (regular text, not a regex), find any line in the input that contains a match for that text, then print the line that contains the word.
The problem I've been having is that I can't figure out how to best print the line. I can print everything after the searched word, but I don't know how to properly store the contents of the whole line.
I tried using yyseek()
, but when I compile, I get back the message that yyseek
is an undefined symbol.
Using yymore()
to store text works well for anything after the matched word in the line.
Here is the code that I have so far:
%option yylineno
%option noyywrap
%{
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *search_str = NULL;
char *curr_line = NULL;
%}
%x found
letter [a-zA-Z]
word {letter}+
line (.*)\n
%%
<INITIAL,found>{word} {
/* If a word matches the string that we are looking for, use the 'found'
* condition, which will cause the line to be dumped at the end.
*/
yymore();
if (strcmp(search_str, yytext) == 0) {
BEGIN(found);
}
}
<found>{line} {
yymore();
ECHO;
BEGIN(INITIAL);
}
. { }
\n {}
%%
int main(int argc, char *argv[])
{
if (argc > 1) {
unsigned int str_len = sizeof(argv[1]);
search_str = malloc(str_len + 1);
strcpy(search_str, argv[1]);
yylex();
free(search_str);
return 0;
}
printf("usage: ./a.out [search word]\n");
return 1;
}
This is really not a good use case for flex. And it's not totally clear to me that it will do what you want, either. (Since I don't actually know what you want, so I could be wrong about that. But note the following:
Target line grep night grep -w night Your code
------------------- ---------- ------------- ---------
a night to remember Yes Yes Yes
a knight to forget Yes No No
night23 Yes No Yes
Anyway, your instinct about using yymore
was correct. You just have to start earlier, so that the entire line is retained in token. The small complication is that when you need to check a word, you can't check from the beginning of yytext
; it contains the whole line up to this point. You have to check the last strlen(search_str)
characters. The following code makes sure it only does that computation once, since it requires a complete scan of search_str
. Also note that it makes sure it does not overrun the beginning of yytext
.
In effect, the following code divides the text into three kinds of tokens: words, non-words, and newlines. Only newline fails to call yymore()
, so when the newline rule triggers, yytext
contains the entire line. As in your code, once a match is found in a line, the rest of the line is simply added to the match.
(Note: I rewrote this without macros, which are generally overused. I don't see any reason to think that {letter}
is more readable than [[:alpha:]]
, and the latter has the advantage of being clear to anyone who knows flex, without having to search for your particular definition.)
%x FOUND
%%
/* Indented lines before the first rule are put at the top of yylex */
int match_length = strlen(search_str);
[^[:alpha:]\n]+ { yymore(); }
[[:alpha:]]+ { yymore();
if (yyleng >= match_length
&& 0 == strcmp(yytext + yyleng - match_length,
search_str))
BEGIN(FOUND);
}
<INITIAL,FOUND>\n BEGIN(INITIAL);
<FOUND>.* printf("%s\n", yytext);
The oddity at the end is to deal with inputs which are not correctly terminated with a newline character. The last pattern will print the line with a newline character (even if there isn't one), and the newline character (if there is one) will restart the start condition.
For a slight speed gain, you could remember the previous value of yyleng
every time you call yymore()
, so that yyleng - prev_yyleng
will be the length of "this part" of the token. (The flex scanner knows this value but doesn't provide any interface for you to find it out, which is slightly annoying. But it's not a big deal.) Then instead of checking whether the entire line up to this point is long enough to make a compare possible, you could check whether the last word matched was exactly the right length, which will be true less often, thereby requiring fewer calls to strcmp
.
All in all, though, this is not a good strategy. You'll probably find that strstr
is faster than flex, and it's only slightly optimised compared to what is possible for a repeated search for the same target. Better would be to implement or find one of the standard search algorithms: