java python-3.x comments antlr4 concrete-syntax-tree

How to include Python comments in Java ANTLR4 visitTerminal overridden function?

Context

I auto-generated the:

Python3Lexer.java
Python3ParserBase.java
Python3ParserListener.java
PythonDocstringModifierListener.java
Python3Parser.java files in accordance with this answer. Then I modified the MWE in that question to include:

public class SomePythonListener extends Python3ParserBaseListener {
  public SomePythonListener
      Python3Parser parser, String someValue) {
    this.parser = parser;
    this.someValue = someValue;
  }

  @Override
  public void visitTerminal(TerminalNode node) {
    Token token = node.getSymbol();
    System.out.println("token.getType()=" + token.getType());
    System.out.println("getText:" + token.getText() + "XXXX\n\n");
  } 
}

And I feed it the source code:

"""A file docstring.
With a multiline starting docstring.
That spans the first 3 lines."""
# Some Comment.

# Another comment
"""Some string."""
def foo():
    """Some docstring."""
    print('hello world')
    def bar():
        """Another docstring."""
        print('hello world')
def baz():
        """Third docstring."""
        print('hello universe')

This then outputs:

token.getType()=3
getText:"""A file docstring.
With a multiline starting docstring.
That spans the first 3 lines."""END

token.getType()=44
getText:
END

token.getType()=3
getText:"""Some string."""END

token.getType()=44
getText:
END

token.getType()=15
getText:defEND

token.getType()=45
getText:fooEND

token.getType()=57
getText:(END

token.getType()=58
getText:)END

token.getType()=60
getText::END

token.getType()=44
getText: END

token.getType()=1
getText:    ENDtoken.getType()=3

For completeness, the 44 represents the new line character, and one can see that the first docstring is included, followed by a new line, followed by the second docstring """Some string.""", however both comments: # Some Comment. and # Another comment are ignored/not visited/not shown.

Issue

The TerminalNode node objects of the visitTerminal do not include the comments.

Question

How can I include the comments in the visitor?

Attempt

Based on these answers it seems I should get those from the hidden channels. I did not yet figure out how to do that. For completeness, the auto-generated Python3Lexer.java file contains:

public static String[] channelNames = {"DEFAULT_TOKEN_CHANNEL", "HIDDEN"};

  public static String[] modeNames = {"DEFAULT_MODE"};

Solution

The TerminalNode node objects of the visitTerminal do not include the comments.

That is correct: these tokens are skipped in the lexer. You can also put these tokens on another channel (so not skip them) by replacing -> skip with -> channel(HIDDEN). But that will still not cause them to appear in the visitTerminal(...) method. After all: only tokens defined in parser rules appear there.

For the record, when changing:

SKIP_ : ( SPACES | COMMENT | LINE_JOINING) -> skip;
...
fragment COMMENT : '#' ~[\r\n\f]*;

to:

COMMENT : '#' ~[\r\n\f]* -> channel(HIDDEN);
SKIP_   : ( SPACES | LINE_JOINING) -> skip;

in the Python3Lexer.g4 file and then re-generate lexer/parser classes, you can see comments are now not discarded, but placed on another channel:

String source = "\"\"\"A file docstring.\n" +
        "With a multiline starting docstring.\n" +
        "That spans the first 3 lines.\"\"\"\n" +
        "# Some Comment.\n" +
        "\n" +
        "# Another comment\n" +
        "\"\"\"Some string.\"\"\"\n" +
        "def foo():\n" +
        "    \"\"\"Some docstring.\"\"\"\n" +
        "    print('hello world')\n" +
        "    def bar():\n" +
        "        \"\"\"Another docstring.\"\"\"\n" +
        "        print('hello world')\n" +
        "def baz():\n" +
        "        \"\"\"Third docstring.\"\"\"\n" +
        "        print('hello universe')\n";

Python3Lexer lexer = new Python3Lexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();


for (Token t : tokenStream.getTokens()) {
    System.out.printf("channel=%s, text=%s%n",
            t.getChannel(), t.getText().replace("\n", "\\n"));
}

will print:


channel=0, text="""A file docstring.\nWith a multiline starting docstring.\nThat spans the first 3 lines."""
channel=1, text=# Some Comment.
channel=1, text=# Another comment
channel=0, text=\n
channel=0, text="""Some string."""
channel=0, text=\n
channel=0, text=def
...

But they will still not be a part of the parse tree you're walking with a listener or visitor: only tokens defined in parser rules will show up there.