java compiler-construction flex-lexer lexer

Distinguishing between right shift (>>) and Java generics

I am writing a lexer for java in flex. The java spec says:

"The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (§4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >."

So how can I distinguish between right shift operator and something like in <List<List>>?

Solution

The original Java generics proposal (JSR-14) required modifying the Java grammar for parameterized types so that it would accept >> and >>> in contexts where multiple close angle brackets were possible. (I couldn't find a useful authoritative link for JSR-14 but Gilad Bracha's GJ specification is still available on his website; the grammar modifications are shown in section 2.3.)

These modifications were never formally incorporated in any Java standard as far as I know; eventually, JLS8 incorporated the change to the description of lexical analysis which you quote in your question. (See JDK-8021600, which also reproduces the convoluted grammar which was originally proposed.)

The grammar modifications proposed by Bracha et al will work, but you might find that they make incorporating other grammar changes more complicated. (I haven't really looked at this in any depth, so it might not actually be a problem for the current Java Language Specification. But it still might be an issue for future editions.)

While contextual lexical analysis does allow the simpler grammar actually used in the JLS, it certainly creates difficulties for lexical analysis. One possible approach is to abandon lexical analysis altogether by using a scannerless parser; this will certainly work but you won't be able to accomplish that within the Bison/Flex model. Also, you might find that some of the modifications needed to support scannerless parsing also require non-trivial changes to the published grammar.

Another possibility is to use lexical feedback from the parser, by incorporating mid-rule actions (MRAs) which turn a "type context" flag on and off when type contexts are entered and exited. (There is a complete list of type contexts in §4.11 which can be used to find the appropriate locations in the grammar.) If you try this, please be aware that the execution of MRAs is not fully synchronised with lexical analysis because the parser generally requires a lookahead token to decide whether or not to reduce the MRA. You often need to put the MRA one symbol earlier in the grammar than you might think, so that it actually takes effect by the time it is needed.

Another possibility might be to never recognise >> and >>> as tokens. Instead, the lexer could return two different > tokens, one used when the immediate next character is a >:

>/>     { return CONJUNCTIVE_GT; }
>       { return INDEPENDENT_GT; }
  /* These two don't need to be changed. */
>>=     { return SHIFT_ASSIGN; }
>>>=    { return LONG_SHIFT_ASSIGN; }

Then you can modify your grammar to recognise >> and >>> operators, while allowing either form of > as a close angle bracket:

shift_op     : CONJUNCTIVE_GT INDEPENDENT_GT
long_shift_op: CONJUNCTIVE_GT CONJUNCTIVE_GT INDEPENDENT_GT
close_angle  : CONJUNCTIVE_GT | INDEPENDENT_GT
gt_op        : INDENPENDENT_GT /* This unit production is not really necessary */

That should work (although I haven't tried it), but it doesn't play well with the Bison/Yacc operator precedence mechanism, because you cannot declare precedence for a non-terminal. So you'd need to use an expression grammar with explicit operator precedence rules, rather than an ambiguous grammar augmented with precedence declarations.