Search code examples
parsingcommentscontext-free-grammarebnf

Is it possible to describe block comments using EBNF?


Say, I have the following EBNF:

document    = content , { content } ;
content     = hello world | answer | space ;
hello world = "hello" , space , "world" ;
answer      = "42" ;
space       = " " ;

This lets me parse something like:

hello world 42

Now I want to extend this grammar with a block comment. How can I do this properly?

If I start simple:

document    = content , { content } ;
content     = hello world | answer | space | comment;
hello world = "hello" , space , "world" ;
answer      = "42" ;
space       = " " ;
comment     = "/*" , ?any character? , "*/" ;

I cannot parse:

Hello /* I'm the taxman! */ World 42

If I extend the grammar further with the special case from above, it gets ugly, but parses.

document    = content , { content } ;
content     = hello world | answer | space | comment;
hello world = "hello" , { comment } , space , { comment } , "world" ;
answer      = "42" ;
space       = " " ;
comment     = "/*" , ?any character? , "*/" ;

But I still cannot parse something like:

Hel/*p! I need somebody. Help! Not just anybody... */lo World 42

How would I do this with an EBNF grammar? Or is it not even possible at all?


Solution

  • Assuming you would consider "hello" as a token, you would not want anything to break that up. Should you need to do so, it becomes necessary to explode the rule:

    hello_world = "h", {comment}, "e", {comment}, "l", {comment}, "l", {comment}, "o" ,
                  { comment }, space, { comment },
                  "w", {comment}, "o", {comment}, "r", {comment}, "l", {comment}, "d" ;
    

    Considering the broader question, it seems commonplace to not describe language comments as part of the formal grammar, but to instead make it a side note. However, it can generally be done by treating the comment as equivalent to whitespace:

    space = " " | comment ;
    

    You may also want to consider adding a rule to describe consecutive whitespace:

    spaces = { space }- ;
    

    Cleaning up your final grammar, but treating "hello" and "world" as tokens (i.e. not allowing them to be broken apart), could result in something like this:

    document    = { content }- ;
    content     = hello world | answer | space ;
    hello world = "hello" , spaces , "world" ;
    answer      = "42" ;
    spaces      = { space }- ;
    space       = " " | comment ;
    comment     = "/*" , ?any character? , "*/" ;