Search code examples
javapython-3.xcommentsantlr4concrete-syntax-tree

How to include Python comments in Java ANTLR4 visitTerminal overridden function?


Context

I auto-generated the:

  • Python3Lexer.java
  • Python3ParserBase.java
  • Python3ParserListener.java
  • PythonDocstringModifierListener.java
  • Python3Parser.java files in accordance with this answer. Then I modified the MWE in that question to include:
public class SomePythonListener extends Python3ParserBaseListener {
  public SomePythonListener
      Python3Parser parser, String someValue) {
    this.parser = parser;
    this.someValue = someValue;
  }

  @Override
  public void visitTerminal(TerminalNode node) {
    Token token = node.getSymbol();
    System.out.println("token.getType()=" + token.getType());
    System.out.println("getText:" + token.getText() + "XXXX\n\n");
  } 
}

And I feed it the source code:

"""A file docstring.
With a multiline starting docstring.
That spans the first 3 lines."""
# Some Comment.

# Another comment
"""Some string."""
def foo():
    """Some docstring."""
    print('hello world')
    def bar():
        """Another docstring."""
        print('hello world')
def baz():
        """Third docstring."""
        print('hello universe')

This then outputs:

token.getType()=3
getText:"""A file docstring.
With a multiline starting docstring.
That spans the first 3 lines."""END

token.getType()=44
getText:
END

token.getType()=3
getText:"""Some string."""END

token.getType()=44
getText:
END

token.getType()=15
getText:defEND

token.getType()=45
getText:fooEND

token.getType()=57
getText:(END

token.getType()=58
getText:)END

token.getType()=60
getText::END

token.getType()=44
getText: END

token.getType()=1
getText:    ENDtoken.getType()=3

For completeness, the 44 represents the new line character, and one can see that the first docstring is included, followed by a new line, followed by the second docstring """Some string.""", however both comments: # Some Comment. and # Another comment are ignored/not visited/not shown.

Issue

The TerminalNode node objects of the visitTerminal do not include the comments.

Question

How can I include the comments in the visitor?

Attempt

Based on these answers it seems I should get those from the hidden channels. I did not yet figure out how to do that. For completeness, the auto-generated Python3Lexer.java file contains:

public static String[] channelNames = {"DEFAULT_TOKEN_CHANNEL", "HIDDEN"};

  public static String[] modeNames = {"DEFAULT_MODE"};

Solution

  • The TerminalNode node objects of the visitTerminal do not include the comments.

    That is correct: these tokens are skipped in the lexer. You can also put these tokens on another channel (so not skip them) by replacing -> skip with -> channel(HIDDEN). But that will still not cause them to appear in the visitTerminal(...) method. After all: only tokens defined in parser rules appear there.

    For the record, when changing:

    SKIP_ : ( SPACES | COMMENT | LINE_JOINING) -> skip;
    ...
    fragment COMMENT : '#' ~[\r\n\f]*;
    

    to:

    COMMENT : '#' ~[\r\n\f]* -> channel(HIDDEN);
    SKIP_   : ( SPACES | LINE_JOINING) -> skip;
    

    in the Python3Lexer.g4 file and then re-generate lexer/parser classes, you can see comments are now not discarded, but placed on another channel:

    String source = "\"\"\"A file docstring.\n" +
            "With a multiline starting docstring.\n" +
            "That spans the first 3 lines.\"\"\"\n" +
            "# Some Comment.\n" +
            "\n" +
            "# Another comment\n" +
            "\"\"\"Some string.\"\"\"\n" +
            "def foo():\n" +
            "    \"\"\"Some docstring.\"\"\"\n" +
            "    print('hello world')\n" +
            "    def bar():\n" +
            "        \"\"\"Another docstring.\"\"\"\n" +
            "        print('hello world')\n" +
            "def baz():\n" +
            "        \"\"\"Third docstring.\"\"\"\n" +
            "        print('hello universe')\n";
    
    Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
    CommonTokenStream tokenStream = new CommonTokenStream(lexer);
    tokenStream.fill();
    
    
    for (Token t : tokenStream.getTokens()) {
        System.out.printf("channel=%s, text=%s%n",
                t.getChannel(), t.getText().replace("\n", "\\n"));
    }
    

    will print:

    
    channel=0, text="""A file docstring.\nWith a multiline starting docstring.\nThat spans the first 3 lines."""
    channel=1, text=# Some Comment.
    channel=1, text=# Another comment
    channel=0, text=\n
    channel=0, text="""Some string."""
    channel=0, text=\n
    channel=0, text=def
    ...
    

    But they will still not be a part of the parse tree you're walking with a listener or visitor: only tokens defined in parser rules will show up there.