Search code examples
javacompiler-constructiontokenjavacc

Can JavaCC distinguish token by its context?


Basic requirement is use keyword as identifier, so I want to distinguish the token from it's context.(e.g.class is a keyword, but we allowed a variable named class).

In java, this is possible, but it's so hard, here is how I do it

TOKEN :
{
    <I_CAL:     "CAL">  : DO_CAL
    | <I_CALL:  "CALL">
    | <I_CMP:   "CMP">
    | <I_EXIT:  "EXIT">
    | <I_IN:    "IN">
    | <I_JMP:   "JMP">
    | <I_JPC:   "JPC">  : NEED_CMP_OP
    | <I_LD:    "LD">   : NEED_DATA_TYPE
    | <I_NOP:   "NOP">
    | <I_OUT:   "OUT">
    | <I_POP:   "POP">
    | <I_PUSH:  "PUSH">
    | <I_RET:   "RET">
    | <I_DATA:  "DATA"> : DO_DATA
    | <I_BLOCK:  ".BLOCK">
}

// T prefix for Token
TOKEN :
{
    <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
// We need below TOKEN in special context, other wise they are just IDENTIFIER
//    | <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT">
//    | <PSEUDO_DATA_TYPE: "CHAR" >
//    | <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD">
//    | <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
    | <T_LABEL: <IDENTIFIER> ([" "])* <COLON>>
}

// Now we need a CMP OP
<NEED_CMP_OP> TOKEN:
{
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> : DEFAULT
}
// Now we need a DATA TYPE
<NEED_DATA_TYPE,DO_CAL> TOKEN:
{
    // EXTENSION Add char to data type
    <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT" | "CHAR"> {
        if(curLexState == DO_CAL){
            SwitchTo(NEED_CAL_OP);
        }else{
            SwitchTo(DEFAULT);
        }
    }
}
// We need a CAL OP
<NEED_CAL_OP> TOKEN:
{
    <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> : DEFAULT
}
// Aslo need to skip the empty
<NEED_DATA_TYPE,NEED_CAL_OP,NEED_CMP_OP,DO_CAL,DO_DATA> SKIP:
{
    " "
|   "\t"
|   "\r"
|   "\f"
}

Source is here, I can distinguish the token from context by curLexState.

It is works, but fussy to do, need to add a lot extra state, and maintain a lot states.Is there any easy way to achieve this ?


Solution

  • There are three ways to do this outlined in the JavaCC FAQ.

    • One is to use lexical states, as you have done. This method can be tricky, but it's the only way to deal with situations where the length of the longest match depends on context or where rules for skipping depend on context. For your problem, it is probably more complex than you need.
    • A second is to use one token kind, and to use semantic lookahead based on the token image to get the parser to treat some token's specially in some circumstances. See the FAQ for more.
    • The third (and usually easiest) approach is to make distinctions at the lexical level and then ignore the distinctions at the syntactic level. This is usually the best way to deal with keywords that can double as identifiers.

    Below I'll give three examples of the third approach.


    Using keywords as identifiers

    If all you want to do is to allow the keyword class to be used as a variable name, there is a very simple way to do this. In the lexer put in the usual rules.

    TOKEN: { <CLASS: "class"> }
    TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > } // Or what you will
    

    In the parser write a production

    Token varName() { Token t ; } : {
    {
        (t = <CLASS> | t = <VARNAME>)
        {return t ;}
    }
    

    Then use varName() elsewhere in the parser.


    The original poster's assembler

    Turning to the assembler example in the original question, let's look at the JPC instruction as an example. The JPC (Jump conditional) instruction is followed by a comparison operator such as Z, B, etc and then an operand that can be a number of things including identifiers. E.g. we could have

    JPC Z fred
    

    But we could also have an identifier named JPC or Z, so

    JPC Z JPC
    

    and

    JPC Z Z
    

    are also a valid JPC instructions.

    In the lexical part we have

    TOKEN : // Opcodes
    {
        <I_CAL: "CAL"> 
    |   <I_JPC: "JPC"> 
    |   ... // other op codes
        <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
    |   <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
    }
    ... // Other lexical rules.
    
    TOKEN : // Be sure this rule comes after all keywords.
    {
        < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
    }
    

    In the parser we have

    Instruction Instruction():{
        Instruction inst = new Instruction();
        Token o = null,dataType = null,calType = null,cmpType = null;
        Operand a = null,b = null; }
    {
        ...
        o = <I_JPC> cmpType = <CMP_OP> a = Operand()
        ...
    }
    
    Operand Operand():{
        Token t ; ... }
    {
         t = <T_REGISTER> ...
    |    t = Identifier()  ...
        ...
    }
    
    Token Identifier : {
        Token t ; }
    {
        t = <IDENTIFIER> {return t ;}
    |   t = <I_CAL>      {return t ;}
    |   t = <I_JPC>      {return t ;}
    |   t = <CMP_OP>     {return t ;}
    | ... // All other keywords
    }
    

    I would suggest excluding register names from the list of other keywords that could be used as identifiers.

    If you do include <T_REGISTER> in that list, then there will be an ambiguity in operand because Operand looks like this

    Operand Operand():{
        Token t ; ... }
    {
         t = <T_REGISTER> ...
    |    t = Identifier()  ...
        ...
    }
    

    Now there is an ambiguity because

    JPC Z R0
    

    has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers not identifiers. Luckly JavaCC will prefer earlier choices, so this is exactly what will happen. You will get a warning from JavaCC. You can ignore the warning. (I add a comment to my source code so that other programmers don't worry.) Or you can suppress the warning with a lookahead specification.

    Operand Operand():{
        Token t ; ... }
    {
         LOOKAHEAD(1) t = <T_REGISTER> ...
    |    t = Identifier()  ...
        ...
    }
    

    Using right context

    So far all the examples have used left context. I.e. we can tell how to treat a token based solely on the sequence of tokens to its left. Let's look at a case where the interpretation of a keyword is based on the tokens to the right.

    Consider this simple imperative language in which all the keywords can be used as variable names.

    P -> Block <EOF>
    Block -> [S Block]
    S -> Assignment | IfElse
    Assignment -> LHS ":=" Exp
    LHS -> VarName
    IfElse -> "if" Exp Block ["else" Block] "end"
    Exp -> VarName
    VarName -> <ID> | if | else | end
    

    This grammar is unambiguous. You can make the grammar more complicated by adding new kinds of statements, expressions and left-hand sides; as long as the grammar stays unambiguous, such complications probably won't make much difference to what I'm going to say next. Feel free to experiment.

    The grammar is not LL(1). There are two places where a choice must be made based on more than one future token. One is the choice between Assignment and IfElse when the next token is "if". Consider the block

    a := b
    if := a
    

    vs

    a := b
    if q
        b := c
    end
    

    We can look ahead for a ":=" like this

    void S() : {} {
        LOOKAHEAD( LHS() ":=" ) Assignment()
    |
        IfElse() 
    }
    

    The other place we need to look ahead is when an "else" or an "end" is encountered at the start of a Block. Consider

    if x
        end := y
        else := z
    end
    

    We can solve this with

    void Block() : {} {
        LOOKAHEAD( LHS() ":=" | "if" ) S() Block()
    |
        {}
    }