Search code examples
bisonflex-lexeryacclex

Lexer for an XML document -- the regex for XML element data is hiding the regex for whitespace -- how to fix it?


I am creating a lexer for an XML document. Here is my XML document (note the actual XML document is much more complex, this is a simple XML document to show the problem):

<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="test.xsd"
          version="1.0">
    <message>Hello, world</message>
</Document>

I want the lexer to produce this:

DOCUMENT_START_TAG
ATTRIBUTE_NAME = version
ATTRIBUTE_VALUE = "1.0"
MESSAGE_START_TAG
ELEMENT_VALUE = Hello, world
MESSAGE_END_TAG
DOCUMENT_END_TAG

That is, I want the lexer to ignore the first line (XML declaration), the whitespace between elements, and the two namespace declarations.

But instead, the lexer produces this:

ELEMENT_VALUE =

DOCUMENT_START_TAG
ATTRIBUTE_NAME = version
ATTRIBUTE_VALUE = "1.0"
ELEMENT_VALUE =

MESSAGE_START_TAG
ELEMENT_VALUE = Hello, world
MESSAGE_END_TAG
ELEMENT_VALUE =

DOCUMENT_END_TAG

The lexer rule for whitespace is not firing. Instead, the rule for element value is firing. So I know what the problem is: the regex for element value is not correct. But I don't know what the correct regex is. Any help you could provide would be much appreciated.

At the bottom is my entire .l file. Here is an explanation of the rules in it:

The first line -- the XML declaration line -- is something that I want the lexer to simply discard. Here is the lexer rule for it:

"<?"[^?>]+"?>"

An XML declaration starts with <? and finishes with ?> and the stuff between is anything except ? and >

I want the lexer to discard whitespace between XML elements. Here is the lexer rule for whitespace:

[ \t\n]+

That gobbles up spaces, tabs, and newlines.

I want the lexer to ignore the two namespace declarations. Here are the lexer rules for them:

[ \t\n]+xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"
[ \t\n]+xsi:noNamespaceSchemaLocation=\"test.xsd\"

Namespace declarations are always preceded by at least one whitespace character.

I want the lexer to return the token DOCUMENT_START_TAG for the <Document> element. The <Document> element has attributes bundled inside of it, so that requires some special care:

"<Document"[^>]*">"         { yyless(9); return(DOCUMENT_START_TAG); }

The <Document> element starts with <Document and then there is some stuff and then it ends with >. The action puts back everything following <Document and returns the token DOCUMENT_START_TAG.

I want the lexer to return DOCUMENT_END_TAG for </Document>. Here's the lexer rule:

"</Document>"               { return(DOCUMENT_END_TAG); }

Here are the lexer rules for the message start and end tags:

"<message>"                 { return(MESSAGE_START_TAG); }
"</message>"                { return(MESSAGE_END_TAG); }

An XML attribute has a name, equals sign, and value wrapped in quotes. Here is the lexer rule for the name:

[^ \t\n=]+/=[ \t\n]*" { return(ATTRIBUTE_NAME); }

The name doesn't contain space, tab, newline, or equals sign. (Using the lookahead operator) following a name is an equals sign, possibly some whitespace, and a quote.

The attribute value is the stuff within quotes:

\"[^"]*\"                   { return(ATTRIBUTE_VALUE); }

I don't want the attribute value to contain the quotes - how to remove them?

I want the lexer to return the value of elements (e.g., Hello, world). An element value doesn't contain < or >

[^<>]+/<                    { return(ELEMENT_VALUE); }

I use lookahead to indicate that the value is always followed by <

Here is my complete .l file:

%{
  enum yytokentype {
    DOCUMENT_START_TAG = 258,
    DOCUMENT_END_TAG = 259,
    MESSAGE_START_TAG = 260,
    MESSAGE_END_TAG = 261,
    ELEMENT_VALUE = 262,
    ATTRIBUTE_NAME = 263,
    ATTRIBUTE_VALUE = 264,
    JUNK = 265
  };
  int yyval;
%}
%%
"<?"[^?>]+"?>"
[ \t\n]+
">"
"="
[ \t\n]+xmlns:xsi=\"http:\/\/www\.w3\.org\/2001\/XMLSchema-instance\"
[ \t\n]+xsi:noNamespaceSchemaLocation=\"test.xsd\"
"<Document"[^>]*">"         { yyless(9); return(DOCUMENT_START_TAG); }
"</Document>"               { return(DOCUMENT_END_TAG); }
"<message>"                 { return(MESSAGE_START_TAG); }
"</message>"                { return(MESSAGE_END_TAG); }
[^ \t\n=]+/=[ \t\n]*\"      { return(ATTRIBUTE_NAME); }
\"[^"]*\"                   { return(ATTRIBUTE_VALUE); }
[^<>]+/<                    { return(ELEMENT_VALUE); }
.                           { return(JUNK);  }
%%

int yywrap(){ return 1;}
int main(int argc, char *argv[])
{
    yyin = fopen(argv[1], "r");
    int tok;
    while (tok = yylex()) {
       switch (tok){
          case 258:
             printf("DOCUMENT_START_TAG\n");
             break;
          case 259:
             printf("DOCUMENT_END_TAG\n");
             break;
          case 260:
             printf("MESSAGE_START_TAG\n");
             break;
          case 261:
             printf("MESSAGE_END_TAG\n");
             break;
          case 262:
             printf("ELEMENT_VALUE = %s\n", yytext);
             break;
          case 263:
             printf("ATTRIBUTE_NAME = %s\n", yytext);
             break;
          case 264:
             printf("ATTRIBUTE_VALUE = %s\n", yytext);
             break;
          case 265:
             printf("JUNK = %s\n", yytext);
             break;
          default:
             printf(" = invalid token, value = %s\n", yytext);
       }
    }
    
    fclose(yyin);
    
    return 0;
}

Solution

  • Your rule for element value always wins out over your whitespace rule because it has a longer match. That's because the trailing context counts as part of the match, even though the lexer backtracks over the trailing context before triggering the action.

    That's documented in the Flex manual, but it's easy to miss.

    It's not clear to me why you feel the need for trailing context. The only characters which could follow [^<>]+ are < and >; if you want to treat > as an error, it would make more sense to flag the error at the point where the > occurs than to flag it at the beginning of the element value which eventually contains the offending character. But it probably makes even more sense to just quietly accept > as an ordinary character. Either way, trailing context is not required, and without that trailing context your whitespace rule will win, where applicable.

    But note that if the XML document used CRLF line endings, the whitespace rule won't catch them. I always suggest using [[:space:]] instead of listing the whitespace characters, although it matches some characters which could be considered errors.

    Similarly, scanning a tag up to the closing > and then backtracking back to the tagname is totally pointless. Either the tag is correctly terminated and you will eventually reach the >, or you will hit the end of the document, at which point you can throw an error. What you should do, however, is catch tags whose tagnames start with Document, such as <Documentary> (which your current pattern will accept). That would suggest something like:

    <Document        { return DOCUMENT_START_TAG; }
    <message         { return MESSAGE_START_TAG; }
    </Document       { return DOCUMENT_END_TAG; }
    </message        { return MESSAGE_END_TAG; }
    </[^[:space:]>]+ { return UNKNOWN_END_TAG; }
    <[^[:space:]>]+  { return UNKNOWN_START_TAG; }