Search code examples
javaxml-parsingantlrantlrworks

Combined grammar works but errors when lexer and parser grammar separated?


Original

this is my first time asking a question on stackoverflow so i hope the solution isn't too blazingly obvious. I am trying to use antlr to parse data from an xml file to generate usable tokens for a java program i am creating in eclipse. I only have experience using antlr with the antlr-works IDE to generate the java code i need to incorporate. The thing is, is that my xml file is very large and complex, and so to start off with, i am only interested in looking at a few attributes at a time. To make things simpler for me, i am attempting to use the filter option to sift through and grab only the data that matches my token definitions. I realize that the filter option can only be used if you are defining your parser and lexer grammar separately, but when i tried to adapt my combined grammar, i suddenly started getting error after error complaining about missing or unwanted tokens, I have been pulling my hair out trying to understand why one works and not the other. I have them saved in the same file, and removing the options statement does nothing to fix the issue.

Here is my combined grammar, followed by my adapted grammar, if anyone can offer me any help or direction i would be so grateful.

Combined:

grammar dataExtract;

prog    :    .*;

SOF     :      ('<posts>');

Tag_string :    ('<')(.~'>')+('>');

Tag :   ('Tags="')Tag_string+('"');

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

EOF :   '</posts>';

Separate:

parser grammar dataExtract;

prog    :    .*;

lexer grammar dataExtract


SOF     :      ('<posts>');


options{filter=true};

Tag_string :    ('<')(.~'>')+('>');

Tag :   ('Tags="')Tag_string+('"');

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

EOF :   '</posts>';

Updated

Thank you for your answers, it makes sense to me now and i am closer to getting my grammar to work, i just have one remaining issue it seems. The parser grammar seems to work just fine, and antlrworks even generates a Java class without complaining, however, the lexer rules seem to break still when i save the lexer definition in it's own .g file, even if the only rule i define is All : .*;, i get an EarlyExitException. Also, if i understand correctly, the tokenVocab option searches for the token file from the lexer grammar, but since i am getting an error and it is not generating any code, there is no token file created yet, and so i would assume that the parser should not be generated correctly without it. Any idea what is happening? I have tried searching similar issues but alot of the material seems to assert that this error is caused when no tokens are found in the input that match the rules, but since i haven't even gotten to the point where i am giving it input this can't be the case.


Solution

  • When separating lexer- and parser grammars, ANTLR does not append either "Lexer" or "Parser" after the name of the generated .java source file. So you should use unique names in this case:

    parser

    parser grammar DataExtractParser;
    
    options {
      tokenVocab=DataExtractLexer; 
    }
    
    ...
    

    lexer

    lexer grammar DataExtractLexer;
    
    ...
    

    Also, as mentioned before me, explicitly indicate what tokens (lexer rules) the parser should use through the tokenVocab=LEXER_GRAMMAR_NAME; option.