Extracting data between tags using Flex-Bison

Being a starter at Flex-Bison I have hit the first roadblock and can't seem to find a way through.

Problem statement: For a given html/xml file it is required to extract data between tags. I have read related questions on SO but don't seem to hit the sweet spot of this problem though

(As this is meant to learn how to use flex-bison, I would not like to switch to using any other language/tool).

The input file contains the following field to be extracted:

<!DOCTYPE html>
<html charset="utf-8" lang="en">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="content-type">
<meta content="text/css" http-equiv="Content-Style-Type">
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.css"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>


<!--<legend> DATA TO BE EXTRACTED</legend>--> //relevant data between <legend> tag

I wrote the following scanner test.l

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
"<\/legend>" {return name3;}
%%

and the parser code parser.y

%{
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define YYERROR_VERBOSE
extern int yylex();
extern int yyparse();
extern FILE *yyin;

%}

%union {
    char *sval;
}

%token <sval> name1
%token <sval> name2 
%token <sval> name3

%%
names : name1 name2 name3 { printf("%s\n", $2); }

%%

int main(int argc, char **argv) {

    // open a file handle to a particular file:
    FILE *myfile = fopen(argv[1], "r");
    // make sure it is valid:
    if (!myfile) {
        printf("I can't open file!");
        return 1;
    }
    // set flex to read from it instead of defaulting to STDIN:
    yyin = myfile;

    // parse through the input until there is no more:
    do {
        yyparse();
    } while (!feof(yyin));

}

void yyerror(char *s) {
    printf("EEK, parse error!  Message:%s",s);
    // might as well halt now:
    exit(1);
}

Compilation using the makefile

all: compile_run

compile_run:
    @bison -d parser.y
    @flex test.l
    @gcc parser.tab.c lex.yy.c -lfl -o run

On executing the program however I get the following error:

EEK, parse error! Message:**syntax error, unexpected name2, expecting name1***

I understand reading the error that as name2 token can match infinitely and also that it appears before expected token name1 from the grammmar.

My question is, now that I have defined the grammar to first look for name1 then name2 and then name3 token why should there be this error.

If I define only one token name1 in the scanner as

<!--<legend>(.*?)<\/legend> {return name1;}

I will get the entire string including the tags. which I can post process to get the data out, but I really think there has to be a smarter way around, which I'll get to know from here :).

Solution

The reason you are having a problem is that you have only defined rules for part of the input file and are hoping that the lexer and parser will just ignore the rest. That is not how the tools work; they try and match everything, so you have to define everything for every aspect of the input data.

I also note that your original lexer file did not build with flex. You have the rules in the wrong order. Your original rule set of:

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
"<\/legend>" {return name3;}
%%

Give the following error:

"test.l", line 8: warning, rule cannot be matched

This is because flex uses the rules in order, and name3 could never be returned as the pattern for name2 would also match name3. You obviously fixed this to be able to build your test program. The fix is to reverse the order of the rules, like this:

%option noyywrap
%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"  {return name1;}
"<\/legend>" {return name3;}
(.*?)   {yylval.sval=strdup(yytext); return name2;}
%%

One feature of flex (bison) that is useful in debugging is the debug mode, which is not surprising!

If we run your code with the debug mode enabled, like this:

bison -d parser.y
flex -d test.l
gcc  parser.tab.c lex.yy.c -lfl -o run

and then execute the program, we now get helpful output from the lexer:

--(end of buffer or a NUL)
--accepting rule at line 8 ("")
EEK, parse error! Message:syntax error, unexpected name2, expecting name1

You can see that your rule (.*?) really does match any text, but not just within the <legend> but everywhere else as well. This means that your parser will see a sequence of tokens name2,name2,name2 before it ever sees the name1. Now your only rule in the parser says in input must start with a name1 token, and thus you get a syntax error.

Now, there are several ways of solving this problem. You can change your bison rules to accept lots of name2 tokens before a name1, or you could upgrade the whole grammar to describe the whole of XML/HTML. At least you might want to upgrade the grammar to accept several <legend> tags in one file! At the moment your grammar only matches a file containing one single <legend> structure and nothing else - remember it does not just ignore the other input (unless you tell it to)!

It would be bigger job to rewrite the grammar for a generalised XML structure, but what could be done is to instruct the flex lexer to ignore the other input so that the name2 pattern is not returned. We just have to write patterns for the other things in the input data file. We need to match other XML tags, the comment line and the white space and tell flex to ignore them.

An example of doing this might be:

%{
#include "parser.tab.h"
%}
%%
"<!--<legend>"           {return name1;}
"<\/legend>"             {return name3;}
"<".[^-](.|[ \t])*">"    ; /* Skip other tags */
"//".*[\r\n]+            ; /* Skip comments */
[\r\n\t ]+               ; /* Skip unused whitespace */
(.*?)                    {yylval.sval=strdup(yytext); return name2;}
%%

When we run this code we do manage to skip some of the unwanted tags:

--(end of buffer or a NUL)
--accepting rule at line 7 ("<!DOCTYPE html>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<html charset="utf-8" lang="en">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<head>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<meta content="text/html; charset=UTF-8" http-equiv
="content-type">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<meta content="text/css" http-equiv="Content-Style-
Type">")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<script src="/commd/jquery.nivo.slider.pack.js"></s
cript>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<link rel="stylesheet" type="text/css" href="/fonts
/stylesheet.css"/>")
--accepting rule at line 9 ("
")
--accepting rule at line 7 ("<link rel="stylesheet" type="text/css" href="/commd
/stylesheet.css"/>")
--accepting rule at line 9 ("


")
--accepting rule at line 10 ("<!--<legend> DATA TO BE EXTRACTED</legend>--> //re
levant data between <legend> tag")
EEK, parse error!  Message:syntax error, unexpected name2, expecting name1

We hit another problem. The mystery is, why did it not match and return name1? This is cause by the matching algorithm being greedy and finding the rule that matches the longest token. To overcome this we have to use the start condition feature of flex, and only match general text when inside the <legend> structure. When using start conditions in match XML we have to be careful as the < symbol is used to denote the state change as well as introducing XML tags. We can recode to switch states like this:

%{
#include "parser.tab.h"
%}
%x legends 
%x finishd
%%
<INITIAL>"<!--<legend>"         {BEGIN(legends); return name1;}
<finishd>"</legend>-->"         {BEGIN(INITIAL); return name3;}
<INITIAL>"<".[^-](.|[ \t])*">"  ; /* Skip other tags */
<INITIAL>"//".*[\r\n]+          ; /* Skip comments */
<INITIAL>[\r\n\t ]+             ; /* Skip unused whitespace */
<legends>[^<>]+                 {BEGIN(finishd); yylval.sval=strdup(yytext); return name2;}
%%

Then magically we get the following:

--(end of buffer or a NUL)
--accepting rule at line 9 ("<!DOCTYPE html>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<html charset="utf-8" lang="en">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<head>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<meta content="text/html; charset=UTF-8" http-equiv
="content-type">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<meta content="text/css" http-equiv="Content-Style-
Type">")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<script src="/commd/jquery.nivo.slider.pack.js"></s
cript>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<link rel="stylesheet" type="text/css" href="/fonts
/stylesheet.css"/>")
--accepting rule at line 11 ("
")
--accepting rule at line 9 ("<link rel="stylesheet" type="text/css" href="/commd
/stylesheet.css"/>")
--accepting rule at line 11 ("


")
--accepting rule at line 7 ("<!--<legend>")
--accepting rule at line 12 (" DATA TO BE EXTRACTED")
--accepting rule at line 8 ("</legend>-->")
 DATA TO BE EXTRACTED
--accepting rule at line 11 (" ")
--(end of buffer or a NUL)
--accepting rule at line 10 ("//relevant data between <legend> tag
")
--(end of buffer or a NUL)
--EOF (start condition 0

Now, if you turn off the flex debugging, you just get the desired output:

DATA TO BE EXTRACTED

You'd still need to update the bison grammar if there was more than one set of data to be extracted; actually you should probably upgrade the whole bison grammar to better match more XML. At least I have explained, tutorial fashion, what was happening and one way of making it work with your sample dataset.