Search code examples
htmlparsingtokenlexertranspiler

Should lexer distinguish different types of string tokens?


I'm writing a jade-like language that will transpile to html. Here's how a tag definition looks like:

section #mainWrapper .container

this transpiles to:

<section id="mainWrapper" class="container">

Should the lexer tell class and id apart or should it only spit out the special characters with names?

In other words, should the token array look like this:

[
    {type: 'tag', value: 'section'},
    {type: 'id', value: 'mainWrapper'},
    {type: 'class', value: 'container'}
]

and then the parser just assembles these into a tree

or should the lexer be very primitive and only return matched strings, and then the parser takes care of distinguishing them?:

[
    {type: 'name', value: 'section'},
    {type: 'name', value: '#mainWrapper'},
    {type: 'name', value: '.container'}
]

Solution

  • As a rule of thumb, tokenisers shouldn't parse and parser shouldn't tokenise.

    In this concrete case, it seems to me unlikely that every unadorned use of a name-like token -- such as section -- would necessarily be a tag. It's more likely that section is a tag because of its syntactic context. If the tokeniser attempts to mark it as a tag, then the tokeniser is tracking syntactic context, which means that it is parsing.

    The sigils . and # are less clear-cut. You could consider them single-character tokens (which the syntax will insist be followed by a name) or you might consider them to be the first character of a special type of string. Some things that might sway you one way or the other:

    • Can the sigil be separated from the following name by whitespace? (# mainWrapper). If so, the sigil is probably a token.

    • Is the lexical form of a class or id different from a name? Think about the use of special characters, for example. If you can't accurately recognise the object without knowing what sigil (if any) preceded it, then it might better be considered as a single token.

    • Are there other ways to represent class names. For example, how do you represent multiple classes? Some possibilities off the top of my head:

      #classA #classB
      #(classA classB)
      #"classA classB"
      class = "classA classB"
      

      If any of the options other than the first one are valid, you probably should just make # a token. But correct handling of the quoted strings might generate other challenges. In particular, it could require retokenising the contents of the string literal, which would be a violation of the heuristic that parsers shouldn't tokenise. Fortunately, these aren't absolute rules; retokenisation is sometimes necessary. But keep it to a minimum.

    The separation into lexical and syntactic analysis should not be a strait-jacket. It's a code organization technique intended to make the individual parts easier to write, understand, debug and document. It is often (but not always) the case that the separation makes it easier for users of your language to understand the syntax, which is also important. But it is not appropriate for every parsing task, and the precise boundary is flexible (but not porous: you can put the boundary where it is most convenient but once it's placed, don't try to shove things through the cracks.)

    If you find that this separation of concerns too difficult for your project, you should either reconsider your language design or try scannerless parsing.