Parser vs. lexer and XML

I'm reading about compilers and parsers architecture now and I wonder about one thing... When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens?

I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens for languages line C, C++, Pascal etc., where there are keywords, names, literals and other word-like strings separated by whitespace, with XML i have a problem, because there aren't any words! It's only plain text interleaved with the markup (tags).

I thought to myself that it could be that these tags and plain text fragments are the tokens, something like that: [TXT][TAG][TAG][TXT][TAG][TXT][TAG][TAG][TXT].... It would be quite reasonable, since SGML doesn't care what's inside the markup delimiters < and > (well, it recognizes special processing instructions and definitions when it founds ? or ! as the next character; comments belong to that group too), and the SGML tokenizer could be a base for the XML/HTML/XHTML parser.

But then I realized that there can be < characters stuffed inside the markup as a part of other syntax: attribute values :-/ Even if it's not quite good idea to put < characters inside attribute values (it's better to use &lt; for that), many browsers and editors deal with that and treat these < as a part of the attribute value, not a tag delimiter.

It complicates things a bit, because I don't see a way to recognize markup like that by a simple Deterministic Finite Automaton (DFA) in the lexer. It looks like it requires a separate context for the automaton when it's inside the tag, and another context when it encounters an attribute value. This would need a stack of states/contexts I think, so DFA might not handle that. Am I right?

What's your view? Is it good to make tokens from tags (markup) and plain text?

Here: http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
is used some kind different technique: they treat < and > (and also </ and />) as separate tokens, and inside the tags they use GENERIC_ID as a token etc.They generally move most of the work to the parser. But they also have to change contexts for the tokenizer: they use different context in the plain text, and different in markup (but they forgot about attribute values context I think, because first occurence of > will end the tag in their lexer).

So what's the best approach for parsing SGML-like languages? Is the lexer really used there? If yes, what strings constitute the tokens?


  • Having built XML and HTML parsers, I have opinions.

    Lexemes in general should be recognizable language elements.

    For XML and HTML, these correspond basically to

    • TAGBEGIN, things of the form of <NAME
    • TAGEND, of the form of >
    • TAGCLOSE, of the form of </NAME>
    • TAGENDANDCLOSE of the form /> (XML only)
    • ATTRIBUTENAME, of the form of NAME
    • EQUALSIGN, being precisely =
    • ATTRIBUTEVALUE, being the value of the exact character string represented by an attribute, regardless of quotes (or even absence of quotes, for legacy HTML). If there are escaped character codes inside the attribute, those code should be converted to their actual character code.
    • CONTENT, which is the text between TAGENDs and TAGBEGINs. Like ATTRIBUTEVALUES, any escaped characters should be converted, so the CONTENT between <B>foo&lt;bar</B> is converted to the text foo<bar If you want to keep the entity invocations as seperate tokens, you could do that, producing streams of CONTENT and ENTITYINVOCATION tokens between TAGENDs and TAGSTARTs; depends on what your goal is.

    We can argue about whether you want to produce a token for HTML/XML comments or not. If you do, you do.

    If we ignore the complications of DTDs and Schemas for XML, that's all you really need.

    How the lexer produces these is more complicated; with XML and HTML, there's a lot of messiness having to do with escapes in the input stream, <[CDATA[ ... ]]> (if I have that right) which is just a funny kind of quote and vanishes when the CONTENT lexeme is produced. To handle all this, you need a pretty sophisticated lexer engine. And yes, as practical matter, you need different lexical states ("modes") to process different parts of the text. I pretty much have one major mode to process things inside <...>, and one major mode to process CONTENT.