parsing grammar language-design higher-order-functions

Unambigious grammar for higher order functions

I have a grammar that looks like this:

<type> ::= <base_type> <optional_array_size>
<optional_array_size> ::= "[" <INTEGER_LITERAL> "]" | ""

<base_type> ::= <integer_type> | <real_type> | <function_type>

<function_type> ::= "(" <params> ")" "->" <type>

<params> ::= <type> <params_tail> | ""
<params_tail> ::= "," <type> <params_tail> | ""

so that I can define types like Integer[42], Real, or (Integer, Real) -> Integer. This is all good and well, but I would like my functions to be first class citizens. Given the grammar above, I can't have arrays of functions, as it would only turn the return type into an array. (Integer, Real) -> Integer [42] won't be an array of 42 functions, but one function that returns an array of 42 integers.

I was considering adding optional parenthesis around function types ((Integer, Real) -> Integer)[42], but that creates another issue (note: I am using a top-down recursive descent parser, so my grammar has to be LL(1)).:

<function_type> ::= "(" <function_type_tail>

<function_type_tail> ::= <params> ")" "->" <type>
                       | "(" <params> ")" "->" <type> ")"

The issue is that first(params) contains "(" because function types could be passed as function parameters: ((Integer) -> Real, Real) -> Integer. This syntax was valid before I modified the grammar, but it no longer works now. How can I modify my grammar to get what I want?

Solution

That's definitely a challenge.

It's much easier to make an LR grammar for that language, although it's still a bit of a challenge. To start with, it's necessary to remove the ambiguity which from

<type> ::= <base_type> <optional_array_size>
<base_type> ::= <function_type>
<function_type> ::= "(" <params> ")" "->" <type>

The ambiguity, as I'm sure you know, results from not knowing whether the [42] in ()->Integer[42] is part of the top-level <type> or the enclosed <function_type>. To remove the ambiguity, we need to be explicit about what construct can take an array size. (Here, I've added the desired production which allows <type> to be parenthesized):

<type>        ::= <atomic_type> <optional_array_size>
              |   <function_type>
<opt_array_size>  ::= ""
              |   <array_size>
<atomic_type> ::= <integer_type>
              |   <real_type>
              |   "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
<opt_params>  ::= ""
              |   <params>
<params>      ::= <type>
              |   <params> "," <type>

Unfortunately, that grammar is LR(2), not LR(1). The problem occurs with

( Integer   ) [ 42 ]
( Integer   ) -> Integer
          ^
          |
          +----------------- Lookahead

At the lookahead point, the parser still doesn't know if it is looking at a (redundantly) parenthesized type or at the parameter list in a function type. It won't know that until it sees the following symbol (which might be the end of input, in addition to the two options above). In both cases, it needs to reduce Integer to <atomic_type> and then to <type>. But then, in the first case it can just shift the close parenthesis, while in the second case it needs to continue reducing, first to <params> and then to <opt_params>. That's a shift-reduce conflict. Of course, it can easily be resolved by looking one more token into the future, but the need to see two tokens into the future is what makes the grammar LR(2).

Fortunately, LR(k) grammars can always be reduced to LR(1) grammars. (This is not true of LL(k) grammars, by the way.) It just gets a bit messy because it is necessary to introduce a bit of redundancy. We do that by avoiding the need to reduce <type> until we know that we have a parameter list, which means that we need to accept "(" <type> ")" without committing to one or the other parse. That leads to the following, where an apparently redundant rule was added to <function_type> and <opt_params> was modified to accept either 0 or at least two parameters:

<type>        ::= <atomic_type> <optional_array_size>
              |   <function_type>
<atomic_type> ::= <integer_type>
              |   <real_type>
              |   "(" <type> ")"
<function_type> ::= "(" <opt_params> ")" "->" <type>
              |   "(" <type> ")" "->" <type>
<opt_params>  ::= ""
              |   <params2>
<params2>     ::= <type> "," <type>
              |   <params2> "," <type>

Now, I personally would stop there. There are lots of LR parser generators out there, and the above grammar is LALR(1) and still reasonably easy to read. But it is possible to convert it to an LL(1) grammar, with quite a bit of work. (I used a grammar transformation tool to do some of these transformations.)

It's straight-forward to remove left-recursion and then left-factor the grammar:

# Not LL(1)
<type>          ::= <atomic_type> <opt_size>
                |   <function_type>
<opt_size>      ::= ""
                |   "[" integer "]"
<atomic_type>   ::= <integer_type>
                |   <real_type>
                |   "(" <type> ")"
<function_type> ::= "(" <fop>
<fop>           ::= <opt_params> ")" to <type>
                |   <type> ")" to <type>
<opt_params>    ::= ""
                |   <params2>
<params2>       ::= <type> "," <type> <params_tail>
<params_tail>   ::= "," <type> <params_tail>
                |   ""

But that's not sufficient, because <function_type> and <atomic_type> can both start with "(" <type>. And there's a similar problem between the productions for the parameter list. To get rid of these issues, we need yet another technique: expand non-terminals in place in order to get the conflicts into the same non-terminal so that we can left-factor them. As with this example, that often comes at the cost of some duplication.

By expanding <atomic_type>, <function_type> and <opt_params>, we get:

<type>     ::= <integer_type> <opt_size>
           |   <real_type> <opt_size>
           |   "(" <type> ")" <opt_size>
           |   "(" ")" "->" <type>
           |   "(" <type> ")" "->" <type>
           |   "(" <type> "," <type> <params2> ")" "->" <type>
<opt_size> ::= "" 
           |   "[" INTEGER_LITERAL "]"
<params2>  ::= ""
           |   "," <type> <params2>

And then we can left-factor to produce

<type>     ::= <integer_type> <opt_size>
           |   <real_type> <opt_size>
           |   "(" <fop>
<fop>      ::= <type> <ftype>
           |   ")" "->" <type>
<ftype>    ::= ") <fcp>
           |   "," <type> <params2> ")" "->" <type>
<fcp>      ::= <opt_size>
           |   "->" <type>
<opt_size> ::= ""
           |   "[" INTEGER_LITERAL "]"
<params2>  ::= ""
           |   "," <type> <params2>

which is LL(1). I'll leave it as an exercise to reattach all the appropriate actions to these productions.