Search code examples
parser-generatorjavaccregex-lookarounds

javaCC Parsing Limitation


I am trying to parse a text file through javaCC. The file consists of multiple sentences, separated by newline. Each line may contain any sequence of "a" and "b" but should end with "a" followed "b" before the newline. JavaCC doesn't parse the same and consumes the terminal tokens a and b as part of the optional series.

This should be parsed successfully by JavaCC:

aa ab aab
aab

The jjt file is as follows:

    options {
STATIC = false ;
FORCE_LA_CHECK = true;
LOOKAHEAD = 20000;
DEBUG_PARSER = true;
DEBUG_LOOKAHEAD = true;
OTHER_AMBIGUITY_CHECK = 3;
}

PARSER_BEGIN(Test)
class Test {
public static void main( String[] args )
throws ParseException {
    Test act = new Test (System.in);
    SimpleNode root = act.Start() ; 
    root.dump (" ");
    //ystem.out.println("Total = "+val);
}
}PARSER_END(Test)

TOKEN_MGR_DECLS :
{
  int stringSize;
}  

SKIP : { < WS : " " >   }
SKIP : {"\t" | "\r" | "\uFFFF" | "\u201a" | "\u00c4" | "\u00ee" | "\u00fa" | "\u00f9" | "\u00ec" | "\u2013" }

TOKEN [IGNORE_CASE] :
{
    < A : "a" >
|   < B : "b" >
|   < NEWLINE : (("\n")+ ) >
}   


SimpleNode Start() throws NumberFormatException :
{
    int i ;
    int value=0 ;
} {
chapter()
{ 
    return jjtThis; }   
}

void chapter() :
{ } {
    (LOOKAHEAD (part_sentence()) part_sentence())+ (newline())? <EOF>
}
void part_sentence() :
{ } {
    <NEWLINE> ( a() | b())+ a() b() 
}
void a() :
{ } {
    <A>
}
void b() :
{ } {
    <B>
}
void newline() throws NumberFormatException :
{ }{
    <NEWLINE> 
    {   System.out.print ("N# ");   }
}

It may be clarified, that non-terminals a() and b() cannot be substituted with a token; they are taken as "a" and "b" only for simplicity. Also, "NEWLINE" cannot be shifted to the end of the non-terminal "part_sentence" due to other constraints.

I am stuck at this problem from the past 4 days. My last hope was semantic parsing - LOOKAHEAD ({!( getToken(1).kind==a() && getToken(2).kind==b() && getToken(3).kind==newline()}) but cannot get a handle to non-terminals! Any help would be deeply appreciated.


Solution

  • [Note: you say any sequence of a's and b's that ends with "ab", but your code uses a + not a *. I'm going to assume you really did mean any sequence that ends with "ab", including the sequence of "ab". End Note.]

    You need to exit the loop on the basis of look ahead. What you want to do is this

    ( LOOKAHEAD( x ) 
      (a() | b() )
    )*
    a() b() <NEWLINE>
    

    where x says if the next items of input do not match a() b() <NEWLINE>. Unfortunately, there is no way to say "do not match" using syntactic look ahead. The trick is to replace the loop with a recursion.

    void oneLine() : {} {
        LOOKAHEAD( a() b() <NEWLINE> )
        a() b() <NEWLINE>
    |
        a() oneLine()
    |
        b() oneLine()
    }
    

    You say that you want the <NEWLINE> at the start of the production. For reasons explained in the FAQ, I don't like using syntactic look ahead that extends beyond the choice at hand. But the following could be done.

    void oneLine() : {} { <NEWLINE> oneLinePrime() }
    
    void oneLinePrime() : {} {
        LOOKAHEAD( a() b() <NEWLINE> )
        a() b()
    |
        a() oneLinePrime()
    |
        b() oneLinePrime()
    }