Search code examples
compiler-errorscompiler-constructionantlrantlr4

ANTLR4.7: rule XXX contains a closure with at least one alternative that can match an empty string'


I am trying to create a grammar to match content like below:

(For a simple grammar to repro this issue please see ADD 1)

[Defines]
  INF_VERSION                    = 0x00010005
  BASE_NAME                      = WebServer
  FILE_GUID                      = 99E87DCF-6162-40c5-9FA1-32111F5197F7
  MODULE_TYPE                    = SEC
  UEFI_SPECIFICATION_VERSION     = 0x00010005

The UEFI_SPECIFICATION_VERSION = 0x00010005 part is optional.

(for brevity, I omitted some of the grammar).

My grammar 1 looks like this:

defines : '[Defines]'
         define_statement+
         ;

define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                  | 'BASE_NAME' EQ BaseName
                  | 'FILE_GUID' EQ RegistryFormatGUID
                  | 'MODULE_TYPE' EQ Edk2ModuleType
                  | ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal)?
                  ;

ANTLR 4.7 reports this error:

message: 'rule defines contains a closure with at least one alternative that can match an empty string'

But if I changed grammar like this:

defines : '[Defines]'
         define_statement+
         | ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal)? // <<< HERE
         ;

define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal
                  | 'BASE_NAME' EQ BaseName
                  | 'FILE_GUID' EQ RegistryFormatGUID
                  | 'MODULE_TYPE' EQ Edk2ModuleType

The error is gone.

My question is, what does the closure mean? Which part is the closure? The define_statement?

After I move the potentially empty alternative, the defines rule can alternate between '[Defines]' define_statement+ and ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal)?, which means defines can still match empty string. How could the error be gone?

ADD 1

To make things more clear, I repro this error with a simplified grammar:

grammar test;

rule : alternate+; // <<<<< HERE
alternate : '1'?;

If I use + or * at HERE, ANTLR will report an error:

'rule rule contains a closure with at least one alternative that can match an empty string'

If I use ? at HERE, ANTLR will report a warning:

'rule rule contains an optional block with at least one alternative that can match an empty string'

I am still not sure why.

ADD 2

Each of the alternate WILL be a child node of rule, so if alternate can be empty string, then it is logically possible to lead to endless child nodes for rule. So I guess this may explain why ANTLR forbids me to do that with alternate+ or alternate*. But if it is with alternate?, at most there will be one child node. It's only a performance issue. So ANTLR just generate a warning.


Solution

  • Let's start with the warning. The application is merely alerting you that something can be matched by the empty string. This is a warning because most of the time, you don't want tokens to match to the empty string.

    defines : '[Defines]'
             define_statement+
             ;
    
    define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                      | 'BASE_NAME' EQ BaseName
                      | 'FILE_GUID' EQ RegistryFormatGUID
                      | 'MODULE_TYPE' EQ Edk2ModuleType
                      | ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal)?
                      ;
    

    Since ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal) is optional (it is followed by ?, it could be replace by nothing, like this:

    define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                      | 'BASE_NAME' EQ BaseName
                      | 'FILE_GUID' EQ RegistryFormatGUID
                      | 'MODULE_TYPE' EQ Edk2ModuleType
                      | 
                      ;
    

    That last | by itself means the rule can match nothing, or the empty string. So the mystery about the warning is solved. They call it a closure, but you could think of it as a "token binding" or "match." I don't think the terminology is all that important in a practical sense.

    The error goes away if you remove the alternative, because then, again rewriting for clarity, we have:

    define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                      | 'BASE_NAME' EQ BaseName
                      | 'FILE_GUID' EQ RegistryFormatGUID
                      | 'MODULE_TYPE' EQ Edk2ModuleType
                      ;
    

    And there's nothing optional there. One of those has to match.

    You've already mentioned in your comments that you understand why moving the rule to its own rule -- that can potentially match an infinite number of empty strings -- is a bad, idea, so I won't belabor that here.

    But why did the error go away when you did that? Because

    defines : '[Defines]'
             define_statement+
             | ('UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal)? // <<< HERE
             ;
    

    is guaranteed to match something, even if it's only the token [Defines] , which is an implicit lexer token. So even if the UEFI thing is empty string, there's still something to parse. That wasn't true in the first version we examined; indeed the whole define_statement rule there could have been an empty string. That's quite a difference from a parsing standpoint.

    Now the big question: Is the [Defines] section truly optional, or not? Only you can answer that. But if it is, perhaps you should just recode it as:

    defines : ('[Defines]' define_statement+)?
    
    define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                      | 'BASE_NAME' EQ BaseName
                      | 'FILE_GUID' EQ RegistryFormatGUID
                      | 'MODULE_TYPE' EQ Edk2ModuleType
                      | 'UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal
    

    This makes it completely optional. Again, only you can decide if this is valid for your grammar and expected input.

    Make sense? I hope I helped you!

    EDIT 1

    To relieve the error, try this grammar (I made explicit tokens for the test values to get it to run):

    grammar Uefi;
    defines : '[Defines]' statement+ ;
    statement : define_statement | uefi_statement ;      
    uefi_statement : 'UEFI_SPECIFICATION_VERSION' EQ SpecVersion_VersionVal ;
    define_statement  : 'INF_VERSION' EQ SpecVersion_VersionVal 
                      | 'BASE_NAME' EQ BaseName
                      | 'FILE_GUID' EQ RegistryFormatGUID
                      | 'MODULE_TYPE' EQ Edk2ModuleType
                      ;
    // DUMMY VALUES               
    SpecVersion_VersionVal : '0x00010005';
    BaseName : 'WebServer';
    RegistryFormatGUID : '99E87DCF-6162-40c5-9FA1-32111F5197F7';
    Edk2ModuleType : 'SEC';
    EQ : '=';
    WS : [ \t\r\n]+ -> skip;
    

    enter image description here