Search code examples
antlrantlr3secs

ANTLR the input matches with the grammar but the program can't realize


I'm writing a parser for SML messages. Input: a file with many SML messages. Ouput: a queue of messages with identified elements. This is my code:

grammar SML;
options {language = Java;}
@header {
  package SECSParser;
 import SECSParser.SMLLexer;
}

@lexer::header {
  package SECSParser;
}

@parser::members {
  public static void main(String[] args) throws Exception {
    String file = "C:\\Messages.sml";
    SMLLexer lexer = new SMLLexer(new ANTLRFileStream(file));
    SMLParser parser = new SMLParser(new CommonTokenStream(lexer));
    parser.program();
  }
}

@lexer::members {
  public static String place = "end";
  public static void setPlace(String text) { SMLLexer.place = text; }
  public static String getPlace() {return SMLLexer.place;}
  public static boolean placeIsType() {
    return (SMLLexer.place.equals("wb")
    | SMLLexer.place.equals("value")
    | SMLLexer.place.equals("type"));
  }
  public static boolean placeIsStreamFunction() {
    return (SMLLexer.place.equals("sf") | SMLLexer.place.equals("name"));
  }
  public static boolean placeIsWaitBit() {
    return (SMLLexer.place.equals("sf") | SMLLexer.place.equals("wb"));
  }
  public boolean ahead() {
    if ((input.LA(-2) == 'S') || (input.LA(-2) == 's')) {
      return false;
    }
    return true;
  }
}

program:(message)* EOF;
message:{System.out.println("MESSAGE     : \n");}
  {SMLLexer.setPlace("name");}
  name ws* ':' ws* {SMLLexer.setPlace("sf");} str_func 
  (ws+ {SMLLexer.setPlace("wb");} waitbit)? (ws+ item)? '.' 
   ws* {SMLLexer.setPlace("end");};

name:LETTER(LETTER| NUMBER| '_')* {System.out.println("NAME     : " + $text + "\n");};
fragment STR:~('\'' | '\"');
NUMBER:'0'..'9';
LETTER:(('A'..'Z') | ('a'..'z'));
str_func: (('S' | 's') stream ('F' | 'f') function);
stream: NUMBER+ {System.out.println("STREAM     : " + $text + "\n");};
function: NUMBER+ {System.out.println("FUNCTION     : " + $text + "\n");};
waitbit: {SMLLexer.placeIsWaitBit()}?=>('W' | 'w') {
  System.out.println("WAITBIT     : " + $text + "\n");
};
item:{System.out.println("ITEM     : \n");} ws* SITEM ws* {SMLLexer.setPlace("type");}
  TYPE ( (ws* '[' number_item ']')? ws+ {SMLLexer.setPlace("value");}value)? 
  ws* EITEM ws* COMMENT? ws*;
SITEM: '<' {SMLLexer.setPlace("type");};
EITEM: '>';
TYPE:{SMLLexer.placeIsType()}?=>( 'A' | 'a' | 'L'| 'l'| 'BINARY'| 'binary'| 'BOOLEAN'| 'boolean'| 'JIS'| 'jis'| 'I8'| 'i8' | 'I1'| 'i1'| 'I2'| 'i2' | 'I4'| 'i4'| 'F4'| 'f4'| 'F8'| 'f8'| 'U8'| 'u8' | 'U1'| 'u1'| 'U2' | 'u2'| 'U4'| 'u4' ){System.out.println("TYPE     : " + $text + "\n");};
number_item: NUMBER+ {System.out.println("NUMBER ITEM     : " + $text + "\n");};
value:(item ws*)+| (string ws*)+| ((LETTER| NUMBER)ws*)+;
COMMENT:('/*' (options {greedy=false;}: .)* '*/') {$channel = HIDDEN;};
string:('\'' STR? '\'')| ('\"' STR? '\"') {System.out.println("VALUE     : " + $text + "\n");};
ANY:.;
ws:(' '| '\t'| '\r'| '\n'| '\f');

This is my file "Message.sml"

Are_You_There1l : S1F4 W.
On_Line_Data:S1F4 W
<L[2]
    <U4 13>
    <U4 7>
>.
W1Are_You_There: S1F4 W.

And the result is:

MESSAGE     : 
NAME     : Are_You_There1l
STREAM     : 1
FUNCTION     : 4
WAITBIT     : W
MESSAGE     : 
NAME     : On_Line_Data
STREAM     : 1
FUNCTION     : 4
WAITBIT     : W
ITEM     : 
MESSAGE     : 
NAME     : L
TYPE     : U4
TYPE     : U4
MESSAGE     : 
NAME     : Are_You_There
STREAM     : 1
FUNCTION     : 4
WAITBIT     : W

**C:\Messages.sml line 4:1 mismatched input 'L' expecting TYPE
C:\Messages.sml line 4:2 mismatched input '[' expecting ':'**

I don't know why my program can't realize TYPE:'L'?? I tried with TYPE'U4', it works.


Solution

  • There are too many things going wrong to be able to provide an answer to your question. Even if your question gets answered, it wouldn't be of any help because the grammar contains too many errors. I recommend throwing this away and starting over. But before starting over, read a couple of ANTLR tutorials or get a hold of a copy of The Definitive ANTLR Reference.

    Some of the issues:

    • you don't seem to know the difference between parser and lexer rules. Some of your parser rules should be lexer rules and some of your lexer rules should really be parser rules;
    • you use fragment rules inside parser rules: this will never work since fragment rules will never turn into tokens themselves. Fragment rule can only be used in lexer rules (or other fragment rules);
    • you're setting (static) lexer-variables from the parser: you cannot do this! The parser buffers tokens at its own will causing your logic to go horribly wrong. There is a strict separation between the lexer and parser: the lexer simply produces tokens without any interference from the parser! Lexing is a separate process. If you do want that, choose something other than ANTLR (Google for "scannerless parsing", "PEG" and/or "packrat"). This problem is most probably why 'L' isn't being tokenized as a TYPE in your particular case;
    • you're using literal tokens, like ('W' | 'w'), but also a LETTER as a lexer rule. However, a single 'w' or 'W' will now never be tokenized as a LETTER since. Defining literal tokens inside parser rule is more or less the same as doing:

      W : 'w' | 'W';
      LETTER : 'a'..'z' | 'A'..'Z'; // this will never match a 'w' or 'W' now!

      This also has to do with the fact that ANTLR's lexer operates independently from the parser.

    Again: you really need to master the basics before continuing IMO.

    Best of luck!