I'm writing a C compiler in Javascript, purely for enrichment (as of yet, I expect no practical use for it, and I will likely not maintain it).
I wrote a lexer that can successfully tokenize, given a list of regular expressions and the type matched by that regular expression, any string.
I have been able to successfully tokenize C source code (somewhat reduced C, to be fair; I need to add more tokenizer patterns to capture everything). I now want to construct ASTs as an intermediate form between the source language and the translated assembly.
To do this, I am trying to implement a function that uses a context-free grammar, defined as an object with
Here is a sample CFG one might feed the parser (this is an adapted excerpt from this C grammar):
var cfg = {
"cast_exp": [
["unary_exp"],
["(", "type_name", ")", "cast_exp"]
],
"unary_exp": [
["primary_exp"],
["++", "unary_exp"],
["--", "unary_exp"]
],
"primary_exp": [
["id"]
]
};
id
is one of the types my tokenizer picks up, so I suppose we can consider "primary_exp" a start symbol.
Now, my thought is to do this recursively; that is, pick up the first token and match it against one of the starting symbols. Recurse on the remaining tokens, sending across the target we matched in the previous call, and see what production rule is composed of the target we just matched.
This isn't making too much sense to me and the way I see it, I will get lost in infinite recursion (or encounter stack overflow on really long source files).
How do I write a function that can walk my array of tokens and, using the CFG described above, construct ASTs? Since I'm doing this for enrichment and as a personal challenge, if you'd like you can provide code, but I'm looking more for guidance and a broad-sense description of such an algorithm.
You can implement an Earley parser. (The Wikipedia site has code, so I'm not supplying any).
Such a parser transitions from state to state as it consumes tokens. In each state, it holds a set of "items":
{ I1 I2 ... In }
Each individual item Ik is a rule and how much of that rule has been processed (a place called "dot").
For a rule
R = A B C D;
where A and B have been seen, an item for R is conceptually the same rule with a dot mark:
R = A B <dot> C D ;
with the dot indicating that A and B have been seen, and C needs to be found. The state/item set (might) look like:
{ P = Q <dot> R S ;I1
R = A B <dot> C D ;
C = <dot> X Y ;
}
Each item Ik represents a possible interpretation of the input seen so far; the reason there are multiple items is that the input may have multiple interpretations valid to the current point in the input stream. Processing tokens will change the state/this set of items.
For our example rule R, when a C is found by the parse (either as a token in the input stream, or if some other rule reduces and produces its left hand side as a nonterminal), dot is moved:
R = A B C <dot> D;
creating a new item for the item set in the next parser state.
All rules in item set are processed for each token; if allows the parser to "shift" over the next rule element, the item with a revised dot is place in the state for the next set; otherwise the rule is no longer a valid interpretation of the input, and is discarded (e.g., isn't placed in the next set). As dot is moved, it indicates that new input is possible, (for rule R above, D is now possible), and rules that allow D to be processed are added to the new state with dot at the beginning of the rule. This may add several new rules to the set.
When a dot ends up at the far end:
R = A B C D <dot> ;
then in effect R has been seen as a nonterminal (this is called a "reduction" to R) and can be used to advance the dot in other rules in the current state that mention R:
P = Q <dot> R S ;
transitions to P = Q R S ;
Now, this process is applied to all items (rule+dot) in the current state as tokens are processed.
The parser starts in a first state with a one-element set consisting of the goal (what you called the "start symbol") rule with a dot indicating that no part of the rule has been consumed, in your case:
{ primary = <dot> id ; }
A little thought will let you conclude that the goal rule always stays in the item set with its dot somewhere. Parsing is complete when the dot in the goal rule falls off the end of the goal rule, e.g., when the goal rule reduces to the goal token, and the input stream is entirely consumed.
Earley parsers are relatively fast, and are very general; they will parse any context-free language. That's surprisingly powerful. (If you understand how an Earley parser works with items, you understand most of what you need to know to understand how LR parsers work). And they are pretty easy to build.
The Wikipedia site has an example worked in more detail.
As to building trees with an Earley parser (or any similar type of parser), whenever a reduction to R takes place, you can build a tree node whose root is type R and whose children are the trees previously built for its elements.
Obviously, when processing a terminal token t, one builds a unit tree for t. [It is easy to understand why when you realize your lexer is actually a sub-parser, which is "reducing" strings of characters to the terminal token. You could have put the lexer rules into the grammar operating on character terminal tokens; you just chose not to, for efficiency reasons. You could do this for fun; the Earley parser will do just fine but will run pretty slowly because now it is doing all that item set management on a larger set of rules, on a per-character basis.].
Keeping track of all this stuff while parsing seems a bit tricky, but is not actually that hard. I leave it to the reader.
For a contrast, see how to do all this parsing and tree building using hand-coded recursive descent parsing. (These aren't as powerful, in particular they can have a hard time with left recursive grammar rules, but they are really easy to write if you have a grammar).