I've written many simple tokenizers and recursive-descent parsers, so I'm familiar with the basic concepts of how they work. But I was surprised when I stumbled upon the following Rust code:
Option<Option<i32>>
We know that Rust has a >>
shift operator, so I think a naive tokenizer would output a >>
token here, which the parser would see as an error (because it expects two >
tokens instead).
But clearly the Rust compiler understands the situation and handles it correctly. What's going on here?
>>
and break it into two tokens that are pushed back into the token stream?You can look at Rust parsing libraries to see how they handle it.
This is a parser that I've written, so I'm most familiar with the concepts.
The tokenizer is a simple byte-by-byte parser which greedily consumes the characters >>
to create a DoubleRightAngle
token.
Once tokenization is done, all of these tokens are collected into a vector and a second parsing pass takes place. During this pass, the parsing location is a complicated index that allows being "split". This allows the parser to decompose a >>
into two >
if needed. The specific parsing functions look for a >>
or two recursive >
depending on what is being parsed.
Both the tokenization and parsing are implemented using the peresil crate.
Syn is another parsing library. Here, they use a related idea: each token is composed of multiple spans, one for each character. That is, the Shr
struct has a spans: [Span; 2]
field.
It appears that the compiler allows "gluing" multiple tokens into a bigger one. During parsing, >>
can be "consumed" and replaced with a >
:
token::BinOp(token::Shr) => {
let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
Some(self.bump_with(token::Gt, span))
}
token::BinOpEq(token::Shr) => {
let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
Some(self.bump_with(token::Ge, span))
}
token::Ge => {
let span = self.token.span.with_lo(self.token.span.lo() + BytePos(1));
Some(self.bump_with(token::Eq, span))
}
There's an additional wrinkle around whitespace. A parser should parse both of both of these types equivalently:
Option<Option<i32>>
Option < Option < i32 > >
However, it should not parse these expressions equivalently:
a >>= 1
a >> = 1