Search code examples
antlr4tokenizelexical-analysis

antlr 4 lexer rule RULE: '<TAG>'; isn't recognized as token but if fragment rule then recognized


EDIT: I've been asked if I can provide the full grammar. I cannot and here is the reason why:

I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.

This was posted in the original answer but to prevent frustration from possible helpers I now promote it to the top of the post. Disclaimer: this is homework related.

I am trying to tokenize a piece of text for homework, and almost everything works as expected, except the following:

TIME                    : '<time>';

This rule used to be in my grammar. When tokenizing the piece of text, I would not see the TIME token, instead I would see a '<time>' token (which I guess Antlr created for me somehow). But when I moved the string itself to a fragment rule and made the TIME rule point to it, like so:

fragment TIME_TAG       : '<time>';
.
.
.
TIME                    : TIME_TAG;

Then I see the TIME token as expected. I've been searching the internet for several hours and couldn't find an answer.

Another thing that happens is the ATHLETE rule which is defined as:

ATHLETE                 : WHITESPACE* '<athlete>' WHITESPACE*;

Is also recognized properly and I see the token ATHLETE, but it wasn't recognized when I didn't allow the WHITESPACE* before and after the tag string.

I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.

Here is my piece of text:

World Record World Record
[1] <time> 9.86 <athlete> "Carl Lewis" <country> "United
States" <date> 25 August 1991
[2] <time> 9.69 <athlete> "Tyson Gay" <country> "United
States" <date> 20 September 2009
[3] <time> 9.82 <athlete> "Donovan Baily" <country>
"Canada" <date> 27 July 1996
[4] <time> 9.58
 <athlete> "Usain Bolt"
 <country> "Jamaica" <date> 16 August 2009

[5] <time> 9.79 <athlete> "Maurice Greene" <country>
"United State" <date> 16 June 1999

My task is simply to tokenize it. I am not being given the definitions of tokens, and I am supposed to decide that myself. I think '<sometag>' is pretty obvious, so are '"' wrapped strings, numbers, dates, and square-bracket surrounded enumerations.

Thanks in advance to any help or any useful knowledge.


Solution

  • (This will be something of a challenge, without just doing your homework, but maybe a few comments will set you on your way)

    The TIME : '<time>'; rule should work just fine. ANTLR only creates tokens for you in parser rules. (parser rules begin with lower case letters, and Lexer rules with uppercase letters, so this wouldn't have been the case with this exact example (perhaps you had a rule name that began with a lower case letter?)

    Note: If you dump your tokens, you'll see the TIME token represented like so:

    [@3,5:10='<time>',<'<time>'>,2:4]
    

    This means that ANTLR has recognized it as the TIME token (I suspect this may be the source of the confusion. It's just how ANTLR prints out the TIME token.)

    As @kaby76 mentions, we usually skip whitespace or throw it into a hidden channel as we don't want to be explicit in parser rules about everywhere we allow whitespace. Either of those options causes them to be ignored by the parser. A very common Whitespace rule is:

    WS: [ \t\r\n]+;`.  
    

    Since you're only tokenizing, you won't need to worry about parser rules.

    Adding this Lexer rule will tokenize whitespace into separate tokens for you so you don't need to account for it in rules like ATHLETE.

    You'll need work out Lexer rules for your content, but perhaps this will help you move forward.