I'm writing a simple functional programming language in Scala using the parser-combinators library.
The syntax is specified here: https://github.com/hejfelix/Frase/blob/master/src/main/scala/it/vigtig/lambda/ParserLike.scala
There is one thing which I haven't been able to solve with the implementation: how do I separate grammar definition from the transformation into AST nodes?
Would be really cool to have a close-to-human-readable grammar directly the parser source, especially considering that I'm the only programmer on the project ATM and it could serve as documentation.
How do I separate the grammar and AST-specific code?
This is a great question, and one that I struggled with for a long time before coming up with a solution that I feel works pretty well for me.
When building a parser, I use two different syntax trees:
an Concrete Syntax Tree, or CST: this is a tree form of the text, and has a 1:1 correspondence with the text. Everything that appears in the text will also appear in the CST.
an Abstract Syntax Tree, or AST: this doesn't necessarily have a 1:1 correspondence with the text, as unnecessary textual details (such as braces, punctuation, etc.) have been removed and don't appear in the AST.
Thus, getting from the input text to the AST has two steps: the first step is to parse the input string into a CST; the second step is to convert the CST to an AST, throwing away unnecessary details.
String -> CST
This is where I use parser combinators. I don't do any manipulation of the tree structure in this stage. The structure of the CST is entirely determined by the combinators used. Each combinator produces a sub-tree of a certain shape, which I never change in this stage. There aren't any actions attached to combinators, so the grammar definition is clean and free of any AST information.
CST -> AST
This is where I massage the parse tree, extracting the important stuff, ignoring the rest. This is also where I often perform context-sensitive checks (for example: checking that a function definition doesn't have duplicate parameter names), keeping these details out of the actual parsing stage.
Example: here's a JSON parser I built using this method: