parsing compilation rust tokenize lexical-analysis

How does the Rust compiler tokenize '>' vs '>>' in generics?

I've written many simple tokenizers and recursive-descent parsers, so I'm familiar with the basic concepts of how they work. But I was surprised when I stumbled upon the following Rust code:

Option<Option<i32>>

We know that Rust has a >> shift operator, so I think a naive tokenizer would output a >> token here, which the parser would see as an error (because it expects two > tokens instead).

But clearly the Rust compiler understands the situation and handles it correctly. What's going on here?

Does the tokenizer keep some state, somehow knowing that it needs to close an angle bracket?
Does the parser check for >> and break it into two tokens that are pushed back into the token stream?
Or something else entirely?

Solution

You can look at Rust parsing libraries to see how they handle it.

Library comparisons

Fuzzy-Pickles

This is a parser that I've written, so I'm most familiar with the concepts.

The tokenizer is a simple byte-by-byte parser which greedily consumes the characters >> to create a DoubleRightAngle token.

Once tokenization is done, all of these tokens are collected into a vector and a second parsing pass takes place. During this pass, the parsing location is a complicated index that allows being "split". This allows the parser to decompose a >> into two > if needed. The specific parsing functions look for a >> or two recursive > depending on what is being parsed.

Both the tokenization and parsing are implemented using the peresil crate.

Syn

Syn is another parsing library. Here, they use a related idea: each token is composed of multiple spans, one for each character. That is, the Shr struct has a spans: [Span; 2] field.

Rustc

It appears that the compiler allows "gluing" multiple tokens into a bigger one. During parsing, >> can be "consumed" and replaced with a >:

token::BinOp(token::Shr) => {
    let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
    Some(self.bump_with(token::Gt, span))
}
token::BinOpEq(token::Shr) => {
    let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
    Some(self.bump_with(token::Ge, span))
}
token::Ge => {
    let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
    Some(self.bump_with(token::Eq, span))
}

Additional points

There's an additional wrinkle around whitespace. A parser should parse both of both of these types equivalently:

Option<Option<i32>>
Option < Option < i32 > >

However, it should not parse these expressions equivalently:

a >>= 1
a >> = 1