Search code examples
parsingcompiler-constructiontokenize

Determining whether a token is the first on a line


I’m in the process of writing a lexer for a custom programming language. First off I want to say this is a personal exercise and I want to do it the hand-written way, and not use any generator tools such as Lex or Flex.

One of the syntaxes in my language is that there are three types of comments: single-line, multi-line, and doc:

  1. single-line comments:
    code(x, y, z); %% a comment that ends at the end of the line
    moreCode(x, y, z);
    
  2. multi-line comments:
    code(x, %- a comment that starts and ends on the same line. -% y, z);
    moreCode(x, %- a comment that starts, contains a line break,
      and then ends. -% y, z);
    
  3. doc comments:
    %%%
    a doc comment. the delimiters must be on their own line
    %%%
    code(x, y, z);
    

The Problem

This question is about tokenizing the doc type of comment (#3). Right now I can successfully tokenize single- and multi-line, and I can tokenize doc comments as if they were multi-line. But this leads to a problem:

Incorrect Behavior

code(); %%% This text
is commented
out. %%% notCommentedOut();

The “doc comment” is treated as a multi-line comment. From the above, my tokenizer incorrectly produces these tokens:

  1. code — identifier
  2. ( — symbol
  3. ) — symbol
  4. ; — symbol
  5. %%% This text is commented out. %%% — comment
  6. notCommentedOut — identifier
  7. ( — symbol
  8. ) — symbol
  9. ; — symbol

Expected Behavior

The above tokenization is incorrect because I want I want to force the %%% delimiters to be on their own line in order to be registered as a doc comment, and I want any %%% that is not on its own line to be treated as a single-line comment (because it starts with %%). That means that the correct tokenization should be:

code(); %%% This is commented out.
notCommentedOut();
'also' + !commentedOut; %%% also commented out
  1. code — identifier
  2. ( — symbol
  3. ) — symbol
  4. ; — symbol
  5. %%% This is commented out. — comment
  6. notCommentedOut — identifier
  7. ( — symbol
  8. ) — symbol
  9. ; — symbol
  10. 'also' — string
  11. + — symbol
  12. ! — symbol
  13. commentedOut — identifier
  14. ; — symbol
  15. %%% also commented out — comment

Similarities

Other languages have similar constructs, for example, in Markdown, headings and fenced code blocks:

# this is a heading

foobar # this is not a heading

```
this is a fenced code block
```

foobar ``` this is not
a fenced code block ```

In LaTeX, we can put block equations:

$$
f(x) = 2^{x + 1}
$$

My Approach

Code

(TypeScript and reduced for clarity.)

// advance the scanner `n` number of characters
function advance(n: number = 1): void {
    if (n === 1) {
        // reassign c0 to the next character
        // reassign c1 to lookahead(1)
        // reassign c2 to lookahead(2)
    } else {
        advance(n - 1)
        advance()
    }
}
while (!character.done) {
    if (whitespace.includes(c0)) {
        const wstoken = new Token(character.value)
        wstoken.type = TokenType.WHITESPACE
        advance()
        while (!character.done && whitespace.includes(c0)) {
            wstoken.cargo += c0
            advance()
        }
        // yield wstoken // only if we want the lexer to return whitespace
        break;
    }

    const token = new Token(character.value)
    if (c0 === ENDMARK) {
        token.type = TokenType.EOF
        advance()
    } else if (c0 + c1 + c2 === comment_doc_start) { // we found a doc comment: `%%%`
        token.type = TokenType.COMMENT
        token.cargo += comment_doc_start
        advance(comment_doc_start.length)
        while (!character.done && c0 + c1 + c2 !== comment_doc_end) {
            if (c0 === ENDMARK) throw new Error("Found end of file before end of comment")
            token.cargo += c0
            advance()
        }
        // add comment_doc_end to token
        token.cargo += comment_doc_end
        advance(comment_doc_end.length)
    } else if (c0 + c1 === comment_multi_start) { // we found a multi-line comment: `%- -%`
        token.type = TokenType.COMMENT
        token.cargo += comment_multi_start
        advance(comment_multi_start.length)
        while (!character.done && c0 + c1 !== comment_multi_end) {
            if (c0 === ENDMARK) throw new Error("Found end of file before end of comment")
            token.cargo += c0
            advance()
        }
        // add comment_multi_end to token
        token.cargo += comment_multi_end
        advance(comment_multi_end.length)
    } else if (c0 + c1 === comment_line) { // we found a single-line comment: `%%`
        token.type = TokenType.COMMENT
        token.cargo += comment_line
        advance(comment_line.length)
        while (!character.done && c0 !== '\n') {
            if (c0 === ENDMARK) throw new Error("Found end of file before end of comment")
            token.cargo += c0
            advance()
        }
        // do not add '\n' to token
    } else {
        throw new Error(`I found a character or symbol that I do not recognize: ${c0}`)
    }
    yield token
}

Thought Process

I’m thinking there are two options, both of which are not desirable.

One option is to have a global variable outside the while loop, a boolean flag indicating whether or not the previous token is whitespace and contains \n. Then use that flag to inform the next token, which starts with %%%. If the flag is true, the comment should close at the next %%%; else it should close at the next \n. I’m not sure if I like this option because it involves setting the flag for each and every token of code. It also doesn’t account for the ending delimiter, which also must be on its own line.

Another option is to backtrack. When the lexer reaches a token starting with %%%, check the previous token to see if it is whitespace and contains \n. If so, the %%% token is a doc comment, and should close at the next %%%. If not, it’s an inline comment, and should close at \n. I really do not like this option as it involves backtracking, which increases complexity and time.

Are these options even remotely correct? Are they doable? Recommended? Or is there another approach I should be taking?


Solution

  • Both of the options you're describing here seem reasonable. I think they're special cases of two other general techniques for building lexical analyzers.

    Your idea of using a boolean flag to store whether you're at the beginning of a line is a particular instance of the idea of scanner states. Many scanners are driven by some sort of finite-state machine, and in some cases the rules by which those finite-state machines operate change based on context. For example, a scanner might switch modes when reading a string literal to account for escape sequences, raw string literals, etc. using a slightly different set of rules than usual. Alternatively, a scanner might keep track of a stack of states to support nested comments. So in that sense, you're in good company for approaching things this way.

    Your idea of using leading context is in the same general family of scanners that are context-aware. Some scanners keep track of information about what's been read so far in order to determine how to interpret new characters. For example, a C++ compiler might keep track of whether it's in the middle of a template when deciding whether to interpret >> as two closing angle braces or as a single bitshift operator.

    As for which of the two is easier - I suspect that the idea of using a global bit to keep track of the scanning state might be a bit easier to implement. If you build the system with the idea that you might generalize this to handle other sorts of scanning states, this would probably be fairly elegant.

    Hope this helps!