Search code examples
antlrantlr4

ANTLR Exclude keywords while parsing a string


I'm trying to make the grammar for a rather simple language using ANTLR4. It's supposed to process some theater-related text. There are just 3 rules.

1 - Any text that starts with a tab (\t), should be just printed out.

    It was a rather warm
    Summer day.

2 - In case the text doesn't start with a tab, it'll most likely contain a character name. For example:

Captain Go forth, my minions!

It would be perfect to grab character name and text they're saying separately.

3 - And there are commands, that also start with a tab, followed by a keyword and some arguments, kind of like this:

    lights ON
    curtain OPEN

This is my grammar:

grammar Theater;

module: statement+ EOF;
statement: function | print | print_with_name;

function: '\t' command NL;
command: lights | curtain;

lights: 'lights' WS ('ON' | 'OFF');
curtain: 'curtain' WS ('OPEN' | 'CLOSE');

print: PRINT;
PRINT: '\t' .*? NL NL;

print_with_name: PRINT_WITH_NAME;
PRINT_WITH_NAME: ~[ \t\r\n] .*? NL NL;

NL: '\r\n' | '\r' | '\n';
WS: [ \t]+?;

I run this on the following test file:

    It was a rather warm
    Summer day.
Captain Go forth, my minions!
    lights ON
    curtain OPEN

And these are tokens I get:

[@0,0:22='\tIt was a rather warm\r\n',<PRINT>,1:0]
[@1,23:36='\tSummer day.\r\n',<PRINT>,2:0]
[@2,37:67='Captain Go forth, my minions!\r\n',<PRINT_WITH_NAME>,3:0]
[@3,68:79='\tlights ON\r\n',<PRINT>,4:0]
[@4,80:94='\tcurtain OPEN\r\n',<PRINT>,5:0]
[@5,95:94='<EOF>',<EOF>,6:0]

print and print with name both work as expected. Commands, on the other hand, are being treated as print. I guess, this is because those are lexer rules, but commands are parser rules. Is there any way I can make it work without converting all commands to lexer rules? I tried hard to write something like "treat all text as Print, except when it starts with one of the keywords". But couldn't really find anything that would work. I'm only starting with antlr, so I must be missing something.

I don't expect you to write the grammar for me. Just mentionion a feature I should use would be perfect.


Solution

  • Lexer modes can be helpful here, which is a way to nudge the lexer in the right direction (make it a bit context sensitive).

    To use lexer modes, you must divide the lexer- and parser-grammar into separate files. Here is TheaterLexer.g4:

    lexer grammar TheaterLexer;
    
    Name      : ~[ \t]+     -> mode(DialogMode);
    K_Lights  : '\tlights'  -> mode(CommandMode);
    K_Curtain : '\tcurtain' -> mode(CommandMode);
    Tab       : '\t'        -> skip, mode(TabMode);
    
    mode DialogMode;
     DialogText : ~[\r\n]+;
     DialogNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
    
    mode CommandMode;
     CommandText : ~[\r\n]+;
     CommandNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
    
    mode TabMode;
     LiteralText : ~[\r\n]+;
     LiteralNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
    

    And the parser part (put it in TheaterParser.g4):

    parser grammar TheaterParser;
    
    options { tokenVocab=TheaterLexer; }
    
    parse
     : file EOF
     ;
    
    file
     : atom*
     ;
    
    atom
     : literal
     | dialog
     | command
     ;
    
    literal
     : LiteralText+
     ;
    
    dialog
     : Name DialogText+
     ;
    
    command
     : K_Lights CommandText+
     | K_Curtain CommandText+
     ;
    

    If you now generate the lexer and parser classes and run the following Java code:

    String source =
            "\tIt was a rather warm\n" +
            "\tSummer day.\n" +
            "Captain Go forth, my minions!\n" +
            "\tlights ON\n" +
            "\tcurtain OPEN";
    
    TheaterLexer lexer = new TheaterLexer(CharStreams.fromString(source));
    TheaterParser parser = new TheaterParser(new CommonTokenStream(lexer));
    ParseTree root = parser.parse();
    
    System.out.println(root.toStringTree(parser));
    

    the following will be printed to your console:

    (parse 
      (file 
        (atom 
          (literal It was a rather warm Summer day.)) 
        (atom 
          (dialog Captain  Go forth, my minions!)) 
        (atom 
          (command \tlights  ON)) 
        (atom 
          (command \tcurtain  OPEN))) <EOF>)
    

    (the indentation is added for readability)

    Note that you can use just a single mode, but I assumed you'd want to treat the tokens differently in the different modes. If this is not the case, you could just do:

    lexer grammar TheaterLexer;
    
    Name      : ~[ \t]+     -> mode(Step2Mode);
    K_Lights  : '\tlights'  -> mode(Step2Mode);
    K_Curtain : '\tcurtain' -> mode(Step2Mode);
    Tab       : '\t'        -> skip, mode(Step2Mode);
    
    mode Step2Mode;
     Text : ~[\r\n]+;
     NewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
    

    and change the parser rules accordingly.