Search code examples
parsingtokenizelexerebnf

How to use EBNF to drive the Parser?


I have a simple EBNF:

<program>   ::= <function>
<function>  ::= int <id> ( ) { <statement> }
<statement> ::= return <expr> ;
<expr>      ::= <int>

I am writing pure C and targeting Linux (for now). The answer doesn't have to be written in C.

I have a Lexer that simply indexes a source file and creates a linked list of token structs. The token only carries a char* and size_t as well as referencing the next token. That's all it has.

struct Token {
  char *ndx;
  size_t length;
  struct Token *next;
}

The Lexer indexes the source with only two basic conditions which it follows to discern - or differentiate - one token from another:

  • a space separates tokens
  • symbols are tokens - ( ) [ ] { } ; ... etc.

The source file I'm starting with looks like:

int main()
{
    return 2;
}

I am trying to figure out how I get from the EBNF to a ProgramExpr with the members - a FunctionExpr and a StatementExpr. Since I have the tokens representing the source file, the parser needs to resolve the tokens - i.e., discover what they are. Based on the production rules, the parser will build an Abstract Syntax Tree (what I also call ExpressionNode.

How do I get from the EBNF to an Expression Node?

I am thinking I will have a struct something like:

struct ExprNode {
  enum ExprType {
    PROG,
    FUNC,
    IDNT,
    STMT,
    EXPR,
    INT_LIT
  } type;
  char *term;
  int pos;
  int line;

  struct ExprNode* left;
  struct ExprNode* right;
};

What is the procedural order of parsing? Since my EBNF leads off with <program>, is my first parsing function looking for a program - i.e., a <function>? And if we don't have the function yet, does the parser drop into a looking for a function?

When I ask, everyone just says, "Use ANTLR." But I have a problem, and I want learn how to resolve it before I start using someone else's black box.

I've read Nora Sandler's article, Write a Compiler. While a good article, it seems some bits are skipped over and I'm just not getting it. I also realize that I may be confusing Parser Generator and Parser Combinator.

P.S. - someone suggested the Dragon Book ... that is on my list of necessary reading.


Solution

  • Since my EBNF leads off with <program>, is my first parsing function looking for a program - i.e., a <function>?

    Yes, so you might have a function look_for_a_program() that calls look_for_a_function().

    And if we don't have the function yet, does the parser drop into a looking for a function?

    look_for_a_function() will be more interesting. First it will expect the keyword int (i.e., a Token that points at the text "int" in the input). If that succeeds, it expects an <id> (a Token that points at word-like text in the input). And so on, mirroring the structure of the <function> rule in your EBNF. (When it gets to the <statement> part, it doesn't look for a Token, it calls look_for_a_statement.) If you can successfully get to recognizing the right brace at the end of the rule, then you can make an ExprNode to represent the <function> that you just recognized. (Dealing with errors is another matter.)

    This is a particular kind of parser called a "recursive descent parser", which you can find tons of help for on the web.

    (Very roughly speaking, ANTLR is a parser generator that will take your EBNF and create something like the above code for you.)

    Some thoughts:

    • You might want to add a field to the Token structure to convey the kind of token that the lexer has recognized. (E.g., an integer literal vs a word-like thing vs punctuation) This isn't strictly necessary, but it will make things easier for the parser.
    • The left and right fields of the ExprNode structure suggest that your AST will be a binary tree. This might work initially, but eventually (as you enlarge your EBNF) you'll probably want to allow an arbitrary number of "child nodes".