Search code examples
parsingbuilt-inbnfc

How to disable built-in rules?


How can I disable all BNFC built-in rules, like Ident, Integer or the spaces being used to separate tokens?

I found them useless and annoying since they interfere with the parsers I'm trying to write.

I already tried to re-define them but it seems like the lexer continues to generate the rules for them. I could manually delete them from the generated files but I'm completely against modifying machine generated code.


Long version on why they are annoying.

I'm just starting to learn how to use BNFC. The first thing I tried is to convert a previous work of mine from Alex to BNFC. In particular I want to match only "good" roman numerals. I thought it would be quite simple: A roman numeral can be seen as a sequence like

<thousand-part> <hundred-part> <tens-part> <unit-part>

Where they cannot all be empty. So a numeral either has a non-empty thousand-part and can be whatever in the rest, or it has an empty thousand-part and thus either hundred- or tens- or unit- part must be non empty. The same thing can be iterated until the base case of units.

So I came up with this, which is more or less a direct translation of what I did in Alex:

N1.            Numeral ::= TokThousands HundredNumber     ;
N2.            Numeral ::= HundredNumberNE                ; --NE = Not Empty
N3.      HundredNumber ::=                                ;
N4.      HundredNumber ::= HundredNumberNE                ;
N5.    HundredNumberNE ::= TokHundreds TensNumber         ;
N6.    HundredNumberNE ::= TensNumberNE                   ;
N7.         TensNumber ::=                                ;
N8.         TensNumber ::= TensNumberNE                   ;
N9.       TensNumberNE ::= TokTens UnitNumber             ;
N10.      TensNumberNE ::= UnitNumberNE                   ;
N11.        UnitNumber ::=                                ;
N12.        UnitNumber ::= UnitNumberNE                   ;
N13.      UnitNumberNE ::= TokUnits                       ;


token TokThousands ({"MMM"} | {"MM"} | {"M"}) ;  -- No x{m,n} in BNFC regexes?
token TokHundreds  ({"CM"} | {"DCCC"} | {"DCC"} | {"DC"} | {"D"} | {"CD"} | {"CCC"} | {"CC"} | {"C"}) ;
token TokTens      ({"IC"} | {"XC"} | {"LXXX"} | {"LXX"} | {"LX"} | {"LX"} | {"L"} | {"IL"} | {"XL"} | {"XXX"} | {"XX"} | {"X"}) ;
token TokUnits     ({"IX"} | {"VIII"} | {"VII"} | {"VI"} | {"V"} | {"IV"} | {"III"} | {"II"} | {"I"}) ;

Now, the problem is that if I try to build this parser, when giving an input like:

MMI

Or in general a numeral that has more than one of the *-parts not empty, the parser gives an error because BNFC cannot match MMI with a single token and thus it uses the built-in Ident rule. Since the rule doesn't appear in the grammar it raises a parsing error, although the input string is perfectly fine by the grammar I defined, it's the bogus Ident rule that's in the way.

Note: I verified that if I separate the different parts with spaces I get the correct input, but later on I want to put spaces to separate whole numbers, not their tokens.


Solution

  • According to BNFC's documentation:

    These types are hard-coded and cannot be value types of rules

    Which means that: there is no way to disable built-in rules without modifying the generated code. The only option would be to write a script that automatically deletes the bogus rules from the generated file and always use a Makefile to build the lexers and parser, to avoid forgetting that step.

    It seems like the authors deliberately decided to reduce the flexibility of BNFC imposing their definition of what an integer literal is, what an identifier should look like, how tokens should be separated etc. They could have provided defaults rules, allowing to disable them with some option, but they decided that if you don't agree with their definitions then you shouldn't use their tool at all.