Search code examples
parsingantlrantlr4abnf

ANTLR4 "rule ... contains a closure with at least one alternative that can match an empty string"


I'm translating the ABNF grammar defined in RFC 5322 to ANTLR4. This is practice to touch-up my ANTLR knowledge.

My trouble is with two rules, defined in the RFC,

obs-body        =   *((*LF *CR *((%d0 / text) *LF *CR)) / CRLF)
obs-unstruct    =   *((*LF *CR *(obs-utext *LF *CR)) / FWS)

and in my ANTLR translation,

rfc5322_obsBody
  : (LFD* CR* ((NUL | rfc5322_text) LFD* CR*)* | CRLF)*
  ;

rfc5322_obsUnstruct
  : (LFD* CR* (rfc5322_obsUText LFD* CR*)* | rfc5322_fws)*
  ;

ANTLR reports the following errors:

error(153): InternetMessage.g4:486:0: rule rfc5322_obsBody contains a closure with at least one alternative that can match an empty string
error(153): InternetMessage.g4:490:0: rule rfc5322_obsUnstruct contains a closure with at least one alternative that can match an empty string

Why? How can I fix these errors?

For reference, here are the trimmed sources.

Common.g4:

grammar Common;

// Parser
//--------------------------------------------------

alpha
  : LA | UA
  | LB | UB
  | LC | UC
  | LD | UD
  | LE | UE
  | LF | UF
  | LG | UG
  | LH | UH
  | LI | UI
  | LJ | UJ
  | LK | UK
  | LL | UL
  | LM | UM
  | LN | UN
  | LO | UO
  | LP | UP
  | LQ | UQ
  | LR | UR
  | LS | US
  | LT | UT
  | LU | UU
  | LV | UV
  | LW | UW
  | LX | UX
  | LY | UY
  | LZ | UZ
  ;

digit
  : D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9
  ;

hexdig
  : LA | UA
  | LB | UB
  | LC | UC
  | LD | UD
  | LE | UE
  | LF | UF
  | digit
  ;

vchar
  : BANG
  | DQUOTE
  | HASH
  | DOLLAR
  | PERCENT
  | AND
  | SQUOTE
  | LPAREN
  | RPAREN
  | STAR
  | PLUS
  | COMMA
  | MINUS
  | DOT
  | SLASH
  | digit
  | COLON
  | SEMICOLON
  | LANGLE
  | EQUAL
  | RANGLE
  | QUESTION
  | AT
  | alpha
  | LSQUARE
  | BSLASH
  | RSQUARE
  | CARROT
  | UNDERSCORE
  | BTICK
  | LCURLY
  | BAR
  | RCURLY
  | TILDE
  ;

alphanum
  : alpha
  | digit
  ;

wsp
  : SP
  | HT
  ;

// Lexer
//--------------------------------------------------

// ASCII
//

NUL         : '\u0000' ;
SOH         : '\u0001' ;
STX         : '\u0002' ;
ETX         : '\u0003' ;
EOT         : '\u0004' ;
ENQ         : '\u0005' ;
ACK         : '\u0006' ;
BEL         : '\u0007' ;
BSP         : '\u0008' ;
HT          : '\t'     ;
LFD         : '\n'     ;
VT          : '\u0011' ;
FF          : '\u0012' ;
CR          : '\r'     ;
SO          : '\u0014' ;
SI          : '\u0015' ;
DLE         : '\u0016' ;
DC1         : '\u0017' ;
DC2         : '\u0018' ;
DC3         : '\u0019' ;
DC4         : '\u0020' ;
NAK         : '\u0021' ;
SYN         : '\u0022' ;
ETB         : '\u0023' ;
CAN         : '\u0024' ;
EM          : '\u0025' ;
SUB         : '\u0026' ;
ESC         : '\u0027' ;
FS          : '\u0028' ;
GS          : '\u0029' ;
RS          : '\u0030' ;
USR         : '\u0031' ;
SP          : ' '      ;
BANG        : '!'      ;
DQUOTE      : '"'      ;
HASH        : '#'      ;
DOLLAR      : '$'      ;
PERCENT     : '%'      ;
AND         : '&'      ;
SQUOTE      : '\''     ;
LPAREN      : '('      ;
RPAREN      : ')'      ;
STAR        : '*'      ;
PLUS        : '+'      ;
COMMA       : ','      ;
MINUS       : '-'      ;
DOT         : '.'      ;
SLASH       : '/'      ;
D0          : '0'      ;
D1          : '1'      ;
D2          : '2'      ;
D3          : '3'      ;
D4          : '4'      ;
D5          : '5'      ;
D6          : '6'      ;
D7          : '7'      ;
D8          : '8'      ;
D9          : '9'      ;
COLON       : ':'      ;
SEMICOLON   : ';'      ;
LANGLE      : '<'      ;
EQUAL       : '='      ;
RANGLE      : '>'      ;
QUESTION    : '?'      ;
AT          : '@'      ;
UA          : 'A'      ;
UB          : 'B'      ;
UC          : 'C'      ;
UD          : 'D'      ;
UE          : 'E'      ;
UF          : 'F'      ;
UG          : 'G'      ;
UH          : 'H'      ;
UI          : 'I'      ;
UJ          : 'J'      ;
UK          : 'K'      ;
UL          : 'L'      ;
UM          : 'M'      ;
UN          : 'N'      ;
UO          : 'O'      ;
UP          : 'P'      ;
UQ          : 'Q'      ;
UR          : 'R'      ;
US          : 'S'      ;
UT          : 'T'      ;
UU          : 'U'      ;
UV          : 'V'      ;
UW          : 'W'      ;
UX          : 'X'      ;
UY          : 'Y'      ;
UZ          : 'Z'      ;
LSQUARE     : '['      ;
BSLASH      : '\\'     ;
RSQUARE     : ']'      ;
CARROT      : '^'      ;
UNDERSCORE  : '_'      ;
BTICK       : '`'      ;
LA          : 'a'      ;
LB          : 'b'      ;
LC          : 'c'      ;
LD          : 'd'      ;
LE          : 'e'      ;
LF          : 'f'      ;
LG          : 'g'      ;
LH          : 'h'      ;
LI          : 'i'      ;
LJ          : 'j'      ;
LK          : 'k'      ;
LL          : 'l'      ;
LM          : 'm'      ;
LN          : 'n'      ;
LO          : 'o'      ;
LP          : 'p'      ;
LQ          : 'q'      ;
LR          : 'r'      ;
LS          : 's'      ;
LT          : 't'      ;
LU          : 'u'      ;
LV          : 'v'      ;
LW          : 'w'      ;
LX          : 'x'      ;
LY          : 'y'      ;
LZ          : 'z'      ;
LCURLY      : '{'      ;
BAR         : '|'      ;
RCURLY      : '}'      ;
TILDE       : '~'      ;
DEL         : '\u007F' ;

// Miscellaneous
//

fragment CRLF  : CR LFD ;

InternetMessage.g4:

/**
 * Internet Message (RFC 5322).
 *
 * @author Oliver Yasuna
 * @see <a href="https://www.rfc-editor.org/rfc/rfc5322.html">RFC 5322</a>
 * @since 1.0.0
 */

grammar InternetMessage;

import Common;

// Parser
//--------------------------------------------------

rfc5322_obsBody
  : (LFD* CR* ((NUL | rfc5322_text) LFD* CR*)* | CRLF)*
  ;

rfc5322_obsUnstruct
  : (LFD* CR* (rfc5322_obsUText LFD* CR*)* | rfc5322_fws)*
  ;

rfc5322_text
  : SOH | STX | ETX | EOT | ENQ | ACK | BEL | BSP | HT
  | VT
  | FF
  | SO | SI | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | USR | SP | BANG | DQUOTE | HASH | DOLLAR | PERCENT | AND | SQUOTE | LPAREN | RPAREN | STAR | PLUS | COMMA | MINUS | DOT | SLASH | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | COLON | SEMICOLON | LANGLE | EQUAL | RANGLE | QUESTION | AT | UA | UB | UC | UD | UE | UF | UG | UH | UI | UJ | UK | UL | UM | UN | UO | UP | UQ | UR | US | UT | UU | UV | UW | UX | UY | UZ | LSQUARE | BSLASH | RSQUARE | CARROT | UNDERSCORE | BTICK | LA | LB | LC | LD | LE | LF | LG | LH | LI | LJ | LK | LL | LM | LN | LO | LP | LQ | LR | LS | LT | LU | LV | LW | LX | LY | LZ | LCURLY | BAR | RCURLY | TILDE | DEL
  ;

rfc5322_obsUText
  : NUL
  | rfc5322_obsNoWsCtl
  | vchar
  ;

rfc5322_obsNoWsCtl
  : SOH | STX | ETX | EOT | ENQ | ACK | BEL | BSP
  | VT
  | FF
  | SO | SI | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | USR
  | DEL
  ;

rfc5322_fws
  : (wsp* CRLF)? wsp+
  | rfc5322_obsFws
  ;

rfc5322_obsFws
  : wsp+ (CRLF wsp+)*
  ;

// Lexer
//--------------------------------------------------

RFC5322_MON  : UM LO LN ;  // 'Mon'.
RFC5322_TUE  : UT LU LE ;  // 'Tue'.
RFC5322_WED  : UW LE LD ;  // 'Wed'.
RFC5322_THU  : UT LH LU ;  // 'Thu'.
RFC5322_FRI  : UF LR LI ;  // 'Fri'.
RFC5322_SAT  : US LA LT ;  // 'Sat'.
RFC5322_SUN  : US LU LN ;  // 'Sun'.

RFC5322_JAN  : UJ LA LN ;  // 'Jan'.
RFC5322_FEB  : UF LE LB ;  // 'Feb'.
RFC5322_MAR  : UM LA LR ;  // 'Mar'.
RFC5322_APR  : UA LP LR ;  // 'Apr'.
RFC5322_MAY  : UM LA LY ;  // 'May'.
RFC5322_JUN  : UJ LU LN ;  // 'Jun'.
RFC5322_JUL  : UJ LU LL ;  // 'Jul'.
RFC5322_AUG  : UA LU LG ;  // 'Aug'.
RFC5322_SEP  : US LE LP ;  // 'Sep'.
RFC5322_OCT  : UO LC LT ;  // 'Oct'.
RFC5322_NOV  : UN LO LV ;  // 'Nov'.
RFC5322_DEC  : UD LE LC ;  // 'Dec'.

RFC5322_DATE               : UD LA LT LE                                              ;  // 'Date'.
RFC5322_FROM               : UF LR LO LM                                              ;  // 'From'.
RFC5322_SENDER             : US LE LN LD LE LR                                        ;  // 'Sender'.
RFC5322_REPLY_TO           : UR LE LP LL LY MINUS UT LO                               ;  // 'Reply-To'.
RFC5322_TO                 : UT LO                                                    ;  // 'To'.
RFC5322_CC                 : UC LC                                                    ;  // 'Cc'.
RFC5322_BCC                : UB LC LC                                                 ;  // 'Bcc'.
RFC5322_MESSAGE_ID         : UM LE LS LS LA LG LE MINUS UI UD                         ;  // 'Message-ID'.
RFC5322_IN_REPLY_TO        : UI LN MINUS UR LE LP LL LY MINUS UT LO                   ;  // 'In-Reply-To'.
RFC5322_REFERENCES         : UR LE LF LE LR LE LN LC LE LS                            ;  // 'References'.
RFC5322_SUBJECT            : US LU LB LJ LE LC LT                                     ;  // 'Subject'.
RFC5322_COMMENTS           : UC LO LM LM LE LN LT LS                                  ;  // 'Comments'.
RFC5322_KEYWORDS           : UK LE LY LW LO LR LD LS                                  ;  // 'Keywords'.
RFC5322_RESENT_DATE        : UR LE LS LE LN LT MINUS UD LA LT LE                      ;  // 'Resent-Date'.
RFC5322_RESENT_FROM        : UR LE LS LE LN LT MINUS UF LR LO LM                      ;  // 'Resent-From'.
RFC5322_RESENT_SENDER      : UR LE LS LE LN LT MINUS US LE LN LD LE LR                ;  // 'Resent-Sender'.
RFC5322_RESENT_TO          : UR LE LS LE LN LT MINUS UT LO                            ;  // 'Resent-To'.
RFC5322_RESENT_CC          : UR LE LS LE LN LT MINUS UC LC                            ;  // 'Resent-Cc'.
RFC5322_RESENT_BCC         : UR LE LS LE LN LT MINUS UB LC LC                         ;  // 'Resent-Bcc'.
RFC5322_RESENT_MESSAGE_ID  : UR LE LS LE LN LT MINUS UM LE LS LS LA LG LE MINUS UI UD ;  // 'Resent-Message-ID'.
RFC5322_RESENT_REPLY_TO    : UR LE LS LE LN LT MINUS UR LE LP LL LY MINUS UT LO       ;  // 'Resent-Reply-To'.
RFC5322_RETURN_PATH        : UR LE LT LU LR LN MINUS UP LA LT LH                      ;  // 'Return-Path'.
RFC5322_RECEIVED           : UR LE LC LE LI LV LE LD                                  ;  // 'Received'.

RFC5322_DATE_C               : RFC5322_DATE COLON              ;  // 'Date:'.
RFC5322_FROM_C               : RFC5322_FROM COLON              ;  // 'From:'.
RFC5322_SENDER_C             : RFC5322_SENDER COLON            ;  // 'Sender:'.
RFC5322_REPLY_TO_C           : RFC5322_REPLY_TO COLON          ;  // 'Reply-To:'.
RFC5322_TO_C                 : RFC5322_TO COLON                ;  // 'To:'.
RFC5322_CC_C                 : RFC5322_CC COLON                ;  // 'Cc:'.
RFC5322_BCC_C                : RFC5322_BCC COLON               ;  // 'Bcc:'.
RFC5322_MESSAGE_ID_C         : RFC5322_MESSAGE_ID COLON        ;  // 'Message-ID:'.
RFC5322_IN_REPLY_TO_C        : RFC5322_IN_REPLY_TO COLON       ;  // 'In-Reply-To:'.
RFC5322_REFERENCES_C         : RFC5322_REFERENCES COLON        ;  // 'References:'.
RFC5322_SUBJECT_C            : RFC5322_SUBJECT COLON           ;  // 'Subject:'.
RFC5322_COMMENTS_C           : RFC5322_COMMENTS COLON          ;  // 'Comments:'.
RFC5322_KEYWORDS_C           : RFC5322_KEYWORDS COLON          ;  // 'Keywords:'.
RFC5322_RESENT_DATE_C        : RFC5322_RESENT_DATE COLON       ;  // 'Resent-Date:'.
RFC5322_RESENT_FROM_C        : RFC5322_RESENT_FROM COLON       ;  // 'Resent-From:'.
RFC5322_RESENT_SENDER_C      : RFC5322_RESENT_SENDER COLON     ;  // 'Resent-Sender:'.
RFC5322_RESENT_TO_C          : RFC5322_RESENT_TO COLON         ;  // 'Resent-To:'.
RFC5322_RESENT_CC_C          : RFC5322_RESENT_CC COLON         ;  // 'Resent-Cc:'.
RFC5322_RESENT_BCC_C         : RFC5322_RESENT_BCC COLON        ;  // 'Resent-Bcc:'.
RFC5322_RESENT_MESSAGE_ID_C  : RFC5322_RESENT_MESSAGE_ID COLON ;  // 'Resent-Message-ID:'.
RFC5322_RETURN_PATH_C        : RFC5322_RETURN_PATH COLON       ;  // 'Return-Path:'.
RFC5322_RECEIVED_C           : RFC5322_RECEIVED COLON          ;  // 'Received:'.

RFC5322_UT   : UU UT    ;  // 'UT'.
RFC5322_GMT  : UG UM UT ;  // 'GMT'.
RFC5322_EST  : UE US UT ;  // 'EST'.
RFC5322_EDT  : UE UD UT ;  // 'EDT'.
RFC5322_CST  : UC US UT ;  // 'CST'.
RFC5322_CDT  : UC UD UT ;  // 'CDT'.
RFC5322_MST  : UM US UT ;  // 'MST'.
RFC5322_MDT  : UM UD UT ;  // 'MDT'.
RFC5322_PST  : UP US UT ;  // 'PST'.
RFC5322_PDT  : UP UD UT ;  // 'PDT'.

Solution

  • With both these rules, there is an alternative that matches nothing, which is then repeated. And repeating something that matches nothing, is not OK.

    By reformatting the rules, it will be apparent which alternative matches nothing:

    rfc5322_obsBody
      : ( LFD* CR* ((NUL | rfc5322_text) LFD* CR*)* // alternative 1
        | CRLF                                      // alternative 2
        )*
      ;
    
    rfc5322_obsUnstruct
      : ( LFD* CR* (rfc5322_obsUText LFD* CR*)* // alternative 1
        | rfc5322_fws                           // alternative 2
        )*
      ;
    

    In both cases, alternative 1 matches nothing (which is then repeated with *).

    In both cases, you should just be able to change the inner * (zero or more) into + (once or more):

    rfc5322_obsBody
      : ( LFD* CR* ((NUL | rfc5322_text) LFD* CR*)+ // alternative 1
        | CRLF                                      // alternative 2
        )* // <- outer *
      ;
    
    rfc5322_obsUnstruct
      : ( LFD* CR* (rfc5322_obsUText LFD* CR*)+ // alternative 1
        | rfc5322_fws                           // alternative 2
        )* // <- outer *
      ;
    

    which will cause alternative 1 to match something, yet the outer * will still cause the entire rule to still match nothing, so you're OK there.