parsing f#abstract-syntax-tree recursive-descent

Parsing nested parentheses in F# using recursive descent

I'm trying to make a recursive descent parser in F#. I looked at http://www.fssnip.net/bM but this type of parser uses a string instead of a list.

I'm struggling with parsing parentheses, and especially nested parentheses. There are a lot of edge cases where the parser fails.

I used the following data type to represent the tokens.

type Token =
    | ParOpen
    | ParClose
    | Identifier of string
    | IntLiteral of int
    | Add

let example = [ParOpen;IntLiteral 3;Token.Add;IntLiteral 5;ParClose]

The data type below is used to represent nodes in the AST. It is a little bit like production rules.

type Expression =
  | Add of Expression * Expression
  | Par of Expression
  | Constant of int
  | Leaf //not used currently

The function below can be used for parsing "3 + 5" for instance. But it does not work for most cases when parsing parentheses. For example "(3 + 5) + 1" will fail.

let rec parse listOfTokens = 
    match listOfTokens with
    | IntLiteral i::Token.Add::list -> Add(Constant i, parse list)
    | IntLiteral i::[] -> Constant i
    | ParOpen::list -> Par(parse list)
    | IntLiteral i::ParClose::[] -> Constant i
    | _ -> failwith "Unexpected token"

let example = [ParOpen;IntLiteral 3;Token.Add;IntLiteral 5;ParClose;Token.Add;IntLiteral 1]

The parse function produces an AST which can be evaluated using the following function. But it does not always calculate correctly.

let rec evaluateExpression exp =
    match exp with
    | Add(x, y) -> evaluateExpression(x) + evaluateExpression(y)
    | Par(expression) -> evaluateExpression(expression)
    | Constant i -> i 
    | _ -> failwith "Unexpected expression"

A big problem is that pattern matching with lists allows me to only look at a few tokens at a time. For instance:

match tokens with
| ParOpen::exp::ParClose -> Par(parse exp) //exp is just one token right now, this does not work

I could implement a function which would filter/remove tokens after a ParClose is detected. Or is there an easier/better way to solve this problem?

Any advice would be appreciated. Or maybe you have a useful link?

Solution

You're trying to pack too much into this one feeble function. You're trying to do everything all at once, as a single, simple recursive function, but the problem is inherently more complex, which is what you're discovering in the form of "too many edge cases". As a general rule, if you have too many edge cases, it means you need to redefine your edges.

The key insight here is that a parsing function shouldn't have a simple signature Token list -> Expression. Instead, the parsing function should:

Be able to return not just an Expression, but any kind of value - i.e. be generic. This way you can parse intermediate values, which aren't themselves Expression yet, and then combine them into an Expression.
Be not a single function, but a whole family of functions - each one parsing its own little bit and not trying to encompass all the glory in one shot.
Be able to return a failure - i.e. a case when whatever the function is parsing couldn't be parsed. This will allow you to try several parsers in succession to see if any of them match.
Return not just a result/failure, but also, crucially, return the remainder of the input stream - i.e. the list of tokens that come after the parsing function stopped parsing. This way you can parse several things in succession, one after another - for example, "open paren", then "some expression", then "close paren".
Provide a way to combine multiple parsers together, as mentioned above - either in succession or as alternatives to try one after another. We'll get back to this one a bit later.

Let's try to write down the type of this function, just so we have it in front of our eyes and can go back to it for reference:

type ParseError = string  // Error is going to be just a string for now
type ParseFn<'a> = Token list -> Result< ('a * Token list), ParseError >

So here you can see that ParserFn is:

Generic
Takes the input list of tokens
Returns either parsing result 'a plus the remainder of tokens, or an error

With that signature in mind, let's try to implement the simplest parsing function that can be - parsing an "open paren" token:

let parseParOpen tokens = 
    match tokens with
    | ParOpen::rest -> Ok (ParOpen, rest)
    | first::_ -> Error $"expected ParOpen, got {first}"
    | [] -> Error $"expected ParOpen, got EOF"

Let's test it out:

> parseParOpen [ParOpen; Add]
Ok (ParOpen, [Add])

> parseParOpen [Add]
Error "expected ParOpen, got Add"

> parseParOpen []
Error "expected ParOpen, got EOF"

Nice! Now lets implement a parser for ParClose:

let parseParClose tokens = 
    match tokens with
    | ParClose::rest -> Ok (ParClose, rest)
    | first::_ -> Error $"expected ParClose, got {first}"
    | [] -> Error $"expected ParClose, got EOF"

Hmmm, that seems awfully repetitive, doesn't it? Can we extract the common parts? Of course we can!

let parseToken t tokens = 
    match tokens with
    | first::rest when first = t -> Ok (t, rest)
    | first::_ -> Error $"expected {t}, got {first}"
    | [] -> Error $"expected {t}, got EOF"

The parseToken function takes the token it's supposed to parse as a parameter, and returns a ParserFn<Token> as a result. Let's try it out:

> parseToken ParOpen [ParOpen; Add; ParClose]
Ok (ParOpen, [Add; ParClose])

> parseToken ParClose [ParOpen; Add; ParClose]
Error "expected ParClose, got ParOpen"

> parseToken ParClose [ParClose; Add]
Ok (ParClose, [Add])

Even nicer!

How about parsing a literal? That's a bit more tricky: we can't just call parseToken (IntLiteral 3), because that would only parse the literal 3, but error out on literal 5. So what to do? Well, this is a legitimate case of requiring a special function for parsing literals. And in general, when designing parsers, you will more or less find a separate function for every rule in your grammar. Sometimes they can be parametrized, like parseParOpen and parseParClose, but in general they can't.

So let's parse a literal:

let parseIntLiteral tokens =
    match tokens with
    | (IntLiteral i)::rest -> Ok (i, rest)
    | first::_ -> Error $"expected an int literal, got {first}"
    | [] -> Error $"expected an int literal, got EOF"

(note that even this can be combined, to an extent, with parseToken, to avoid repeating the two error lines; but I won't go into that, since this answer is already too long)

Let's test it out:

> parseIntLiteral [IntLiteral 5; Add; ParClose]
Ok (5, [Add; ParClose])

> parseIntLiteral [ParOpen; Add; ParClose]
Error "expected an int literal, got ParOpen"

Now, how would we go about parsing several things in succession - like, for example, "3 + 5"? Well, that should be pretty straightforward: first parse an int, then a plus sign, then an int again. Let's try that out:

let parseAddition tokens = 
    match parseIntLiteral tokens with
    | Error e -> Error e
    | Ok (x, tokens1) ->
        match parseToken Add tokens1 with
        | Error e -> Error e
        | Ok (_, tokens2) ->
            match parseIntLiteral tokens2 with
            | Error e -> Error e
            | Ok (y, rest) -> Ok (Expression.Add (Constant x, Constant y), rest)

Wow, that's what one calls a "Pyramid of Doom"! This one has just three levels, but imagine if we got to more complex stuff!

So what to do? Can we, perhaps, extract some common parts? Oh yes we can! Notice how it's the same pattern at all three levels: first we apply some parser to the incoming tokens, then if it's an error, we bail right away, otherwise we apply the next parser, rinse and repeat. We can surely capture that pattern as a function:

let applyNextParser firstParser nextParser tokens =
    match firstParser tokens with
    | Error e -> Error e
    | Ok (r, rest) -> nextParser r rest

Note how firstParser is just a parser, but nextParser is a function that takes the first parser's result and returns another parser, which we then apply to the rest tokens.

let parseAddition tokens =
    let combinedParser =
        applyNextParser parseIntLiteral (fun x ->
            applyNextParser (parseToken Add) (fun _ ->
                applyNextParser parseIntLiteral (fun y ->
                    fun restTokens -> Ok (Expression.Add (Constant x, Constant y), restTokens)
                )
            )
        )
      
    combinedParser tokens

One thing to note about this is that applyNextParser returns a parsing function, which we store in the combinedParser variable and then call it with our parameter tokens. Of course we can just get rid of the intermediate variable:

let parseAddition =
    applyNextParser parseIntLiteral (fun x ->
        applyNextParser (parseToken Add) (fun _ ->
            applyNextParser parseIntLiteral (fun y ->
                fun restTokens -> Ok (Expression.Add (Constant x, Constant y), restTokens)
            )
        )
    )

Another thing to note is the expression in the very middle - it's a function that takes restTokens and returns Ok without consuming any of those restTokens. Such function can also be looked at as a kind of parser. It's a parser that doesn't consume any input, but already has a ready parsing result for you. This will be very useful as we proceed, so let's extract this pattern as well:

let constParser c tokens = Ok (c, tokens)

And then:

let parseAddition =
    applyNextParser parseIntLiteral (fun x ->
        applyNextParser (parseToken Add) (fun _ ->
            applyNextParser parseIntLiteral (fun y ->
                constParser (Expression.Add (Constant x, Constant y))
            )
        )
    )

Ok, this is much nicer than the original Pyramid of Doom, but still pretty pyramidal. Can we do any better? Oh yes we can, if only we make applyNextParser into an operator:

let (>>=) firstParser nextParser tokens =
    match firstParser tokens with
    | Error e -> Error e
    | Ok (r, rest) -> nextParser r rest

And then:

let parseAddition =
    parseIntLiteral >>= (fun x ->
        parseToken Add >>= (fun _ ->
            parseIntLiteral >>= (fun y ->
                constParser (Expression.Add (Constant x, Constant y))
            )
        )
    )

What's that? Not much better you say? But wait! Now that it's an operator, the F# syntax allows us to remove the parentheses:

let parseAddition =
    parseIntLiteral >>= fun x ->
        parseToken Add >>= fun _ ->
            parseIntLiteral >>= fun y ->
                constParser (Expression.Add (Constant x, Constant y))

What, still pyramidal? But wait again! The F# syntax also allows us to get rid of the indentation:

let parseAddition =
    parseIntLiteral >>= fun x ->
    parseToken Add >>= fun _ ->
    parseIntLiteral >>= fun y ->
    constParser (Expression.Add (Constant x, Constant y))

And behold! Now it almost looks like we're "assigning" the result of each parser to a variable. The first parseIntLiteral is "assigned" to x, the second parseIntLiteral is "assigned" to y, and then the ultimate result combines x and y to get a new parsing result. We finally have a sane way to combine multiple parsers in a sequence!

Let's test it out:

> parseAddition [IntLiteral 3; Add; IntLiteral 5]
Ok (Add (Constant 3, Constant 5), [])

> parseAddition [IntLiteral 3; ParOpen; IntLiteral 5]
Error "expected Add, got ParOpen"

> parseAddition [Add; IntLiteral 5]
Error "expected an int literal, got Add"

Phew! Now, finally, we can parse the coveted parenthesized expression:

let parseParens =
    parseToken ParOpen >>= fun _ ->
    parseAddition >>= fun exp ->
    parseToken ParClose >>= fun _ ->
    constParser exp

> parseParens [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
Ok (Add (Constant 3, Constant 5), [])

> parseParens [ParOpen; IntLiteral 3; Add; IntLiteral 5]
Error "expected ParClose, got EOF"

Ok, that works, but wait! Does this mean we can only parse parentheses? What about an expression without any parentheses? Well, here we have finally come to the question of trying out several parsers in sequence and seeing which one works. For you see, your expression grammar actually has a hidden structure: it could either be parenthesized or not, and parenthesized should take precedence. That is, we first try to parse a parenthesized expression, and only if that doesn't work, we should fall back to the "regular" one.

So, just like before, let's try to just do this:

let parseExpression tokens =
    match parseParens tokens with
    | Ok (exp, rest) -> Ok (exp, rest)
    | Error _ ->
        match parseAddition tokens with
        | Ok (exp, rest) -> Ok (exp, rest)
        | Error e -> Error e

Here, we first try parseParens, and only if that fails, we go back and try parseAddition. This works, but once again, we got a Pyramid of Doom. Sure it's tiny and cute right now, but imagine adding a few more alternatives. Except this time, the pyramid grows out of the Error cases, whereas last time it was the Ok cases. But no matter: just like last time, we can extract the pyramid aspect into a handy operator:

let (<|>) p1 p2 tokens =
    match p1 tokens with
    | Ok (result, rest) -> Ok (result, rest)
    | Error _ ->
        match p2 tokens with
        | Ok (result, rest) -> Ok (result, rest)
        | Error e -> Error e

let parseExpression =
    parseParens <|> parseAddition

Boom! Now we can clearly see that an "expression" is either "parens" or "addition". Readable, eh?

(note that our definition of <|> swallows the first error and only preserves the second one; this answer is already way too long, so I'll leave the logic of combining the errors as an exercise)

Ok, let's test it out:

> parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
Ok (Add (Constant 3, Constant 5), [])

> parseExpression [IntLiteral 3; Add; IntLiteral 5]
Ok (Add (Constant 3, Constant 5), [])

Nice! Let's test it out some more:

> parseExpression [IntLiteral 3]
Error "expected Add, got EOF"

Oh. Oh my. What happened here?

Well, the program behaves exactly as we told it to, actually: an expression is either a parenthesized addition or a naked addition. A single int literal is not allowed. That's what we wrote, so that's what the program does. In fact, we need to do some more work to define what our expression actually is:

EXPRESSION is either an ADDITION or a TERM
TERM is either an IntLiteral or PARENS
ADDITION is a TERM, followed by Add, followed by an EXPRESSION
PARENS is ParOpen, followed by EXPRESSION, followed by ParClose

Note 1: these definitions are mutually recursive. They have to be, because you want sub-expressions inside parens, and parens as part of expressions.
Note 2: I had to invent a new concept - TERM. This is required to avoid making the grammar infinitely recursive: because EXPRESSION can be an ADDITION, we can't have ADDITION itself start with an EXPRESSION. This is the kind of subtlety you only get with experience (or formal education).
Note 3: I left out Identifier: you're not talking about it at all, so I'll leave it as an exercise.

So now that we have these rules, let's try to rewrite our parsers. They will have to be mutually recursive of course, and we can follow the above English definitions very closely, more or less translating them into F# word for word:

let rec parseExpression =
    parseAddition <|> parseTerm

and parseTerm =
    parseParens
    <|> (parseIntLiteral >>= fun i -> constParser (Constant i))

and parseAddition =
    parseTerm >>= fun x ->
    parseToken Add >>= fun _ ->
    parseExpression >>= fun y ->
    constParser (Expression.Add (x, y))

and parseParens =
    parseToken ParOpen >>= fun _ ->
    parseExpression >>= fun exp ->
    parseToken ParClose >>= fun _ ->
    constParser exp

Let's test it out:

> parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose]
Ok (Add (Constant 3, Constant 5), [])

> parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose; Add; IntLiteral 8]
Ok (Add (Add (Constant 3, Constant 5), Constant 8), [])

> parseExpression [IntLiteral 3; Add; IntLiteral 5; Add; IntLiteral 8]
Ok (Add (Constant 3, Add (Constant 5, Constant 8)), [])

> parseExpression [IntLiteral 3]
Ok (Constant 3, [])

> parseExpression [ParOpen; IntLiteral 3; ParClose]
Ok (Constant 3, [])

> parseExpression [ParOpen; IntLiteral 3; Add; IntLiteral 5; ParClose; Add; ParOpen; IntLiteral 8; Add; ParOpen; IntLiteral 1; Add; IntLiteral 10; ParClose; ParClose];;
Ok
  (Add
     (Add (Constant 3, Constant 5),
      Add (Constant 8, Add (Constant 1, Constant 10))), [])

And in conclusion, a few general notes:

The above explains the logic of how it all works and fits together, but actual modern parser libraries are a bit more complicated, because they are performance-optimized and support streaming inputs.
One of such actual parser libraries for F# is FParsec. It implements all of the above and much, much more. I highly recommend checking it out.
As mentioned already, you have to be careful with recursion. The best way to make sure you understand what your grammar is and how it's supposed to work is to write it down formally before starting to write code - kind of like I did above with EXPRESSION, TERM, ADDITION, and PARENS. There are well-developed formal systems for writing down grammars. One de-facto standard is the Backus-Naur form, check it out.
If you add more operators, like subtraction or multiplication, this will get an order of magnitude more complicated due to the subtle handling of operator precedence. If/when you get to this, remember: the secret is inventing more intermediate concepts, like I did with TERM above.