Search code examples
parsingcompilationrusttokenizelexical-analysis

How does the Rust compiler tokenize '>' vs '>>' in generics?


I've written many simple tokenizers and recursive-descent parsers, so I'm familiar with the basic concepts of how they work. But I was surprised when I stumbled upon the following Rust code:

Option<Option<i32>>

We know that Rust has a >> shift operator, so I think a naive tokenizer would output a >> token here, which the parser would see as an error (because it expects two > tokens instead).

But clearly the Rust compiler understands the situation and handles it correctly. What's going on here?

  • Does the tokenizer keep some state, somehow knowing that it needs to close an angle bracket?
  • Does the parser check for >> and break it into two tokens that are pushed back into the token stream?
  • Or something else entirely?

Solution

  • You can look at Rust parsing libraries to see how they handle it.

    Library comparisons

    Fuzzy-Pickles

    This is a parser that I've written, so I'm most familiar with the concepts.

    The tokenizer is a simple byte-by-byte parser which greedily consumes the characters >> to create a DoubleRightAngle token.

    Once tokenization is done, all of these tokens are collected into a vector and a second parsing pass takes place. During this pass, the parsing location is a complicated index that allows being "split". This allows the parser to decompose a >> into two > if needed. The specific parsing functions look for a >> or two recursive > depending on what is being parsed.

    Both the tokenization and parsing are implemented using the peresil crate.

    Syn

    Syn is another parsing library. Here, they use a related idea: each token is composed of multiple spans, one for each character. That is, the Shr struct has a spans: [Span; 2] field.

    Rustc

    It appears that the compiler allows "gluing" multiple tokens into a bigger one. During parsing, >> can be "consumed" and replaced with a >:

    token::BinOp(token::Shr) => {
        let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
        Some(self.bump_with(token::Gt, span))
    }
    token::BinOpEq(token::Shr) => {
        let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
        Some(self.bump_with(token::Ge, span))
    }
    token::Ge => {
        let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
        Some(self.bump_with(token::Eq, span))
    }
    

    Additional points

    There's an additional wrinkle around whitespace. A parser should parse both of both of these types equivalently:

    Option<Option<i32>>
    Option < Option < i32 > >
    

    However, it should not parse these expressions equivalently:

    a >>= 1
    a >> = 1