Search code examples
parsinglexerbnfll-grammarrecursive-descent

Tokenize abstract terminals in LL grammar


I am currently writing a basic parser for an XML flavor. As an exercise, I am implementing an LL table-driven parser.

This is my example of BNF grammar:

%token name data string

%% /* LL(1) */

doc          :  elem

elem         :  "<" open_tag

open_tag     :  name attr close_tag

close_tag    :  ">" elem_or_data "</" name ">"
             |  "/>"
             ;

elem_or_data :  "<" open_tag elem_or_data
             |  data elem_or_data
             |  /* epsilon */
             ;

attr         :   name ":" string attr
             |  /* epsilon */
             ;

Is this grammar correct ?

Each terminal literal is between quotes. The abstract terminals are specified by %token.

I am coding an hand-written lexer to convert my input into a tokens list. How would I tokenize the abstract terminals ?


Solution

  • The classic approach would be to write a regular expression (or other recogniser) for each possible terminal.

    What you call "abstract" terminals, which are perfectly concrete, are actually terminals whose associated patterns recognise more than one possible input string. The string actually recognised (or some computed function of that string) should be passed to the parser as the semantic value of the token.

    Nominally, at each point in the input string, the tokeniser will run all recognisers and choose the one with the longest match. (This is the so-called "maximal munch" rule.) This can usually be optimised, particularly if all the patterns are regular expressions. (F)lex will do that optimisation for you, for example.

    A complication in your case is that the tokenisation of your language is context-dependent. In particular, when the target is elem_or_data, the only possible tokens are <, </ and "data". However, inside a tag, "data" is not possible, and "name" and "string" tags are possible (among others).

    It is also possible that the value of an attribute could have the same lexical form as the key (i.e. a name). In XML itself, the attribute value must be a quoted string and the use of an unquoted string will be flagged as an error, but there are certainly "XML-like" languages (such as HTML) in which attribute values without whitespace can be inserted unquoted.

    Since the lexical analysis depends on context, the lexical analyser must be passed (or have access to) an additional piece of information defining the lexical context. This is usually represented as a single enumeration value, which could be computed based on the last few tokens returned, or based on the FIRST set of the current parser stack.