context-free-grammar ebnf left-recursion

Preferred way of removing left recursion of EBNF

Many textbooks propose different techniques of removing left recursion. Some of them alter the associativity, which is something I would like to avoid.

What is the most straightforward way to remove left recursion in the following (example) grammar?

logical_or_expression
    = logical_and_expression
    | logical_or_expression , "||" , logical_and_expression
    ;

The example is written in extended backus-naur form and taken from the C grammar of Annex A in the C11 standard. This is done for the sake of transforming the grammar into EBNF for recursive-descent parsing.

My take on the problem:

logical_or_expression
    = logical_and_expression, { "||" , logical_and_expression }
    ;

But does this preserve the left-associativity of logical_or_expression?

Solution

I suppose you meant

logical_or_expression
    = { logical_and_expression, "||" } , logical_and_expression
    ;

Because otherwise, you haven't removed the left-recursion at all. But that is not really suitable for recursive parsing either, since you cannot tell from the first token of the logical_or_expression whether or not the repeated clause should be predicted. That's why the usual style is

logical_or_expression
    = logical_and_expression, { "||" , logical_and_expression }
    ;

Unfortunately, it is reasonable to say that the above does not preserve left-associativity. However, there is also a sense in which the { ... } EBNF syntax fails to define associativity at all. The first interpretation (that the syntax is inherently right-associative) comes from the special status of the first logical_and_expression; the second interpretation (that the syntax fails to specify associativity) comes from the lack of defined ordering of the repetition operator. [Note 1]

Personally, I prefer adding an "interpolation" (or "separated repetition") syntax, which would allow a simpler expression of separated lists:

logical_or_expression
    = { logical_and_expression // "||" }
    ;

Or:

argument_list
    = { expression // "," }
    ;

There's no inherent reason not to define an interpolation operator, and some varieties of augmented BNF do have such a thing. [Notes 2&3] This still doesn't clearly specify associativity, but it has two advantages: (1) it's shorter, and (2) it doesn't artificially privilege the first element of the list.

When repetition (of any form) is translated back into BNF, a decision needs to be made as to whether the repetition is left- or right-recursive, which is also a decision about whether the repetition is left- or right-associative. If one were to translate a repetition operator into BNF prior to building a recursive-descent parser, there would be no option other than to use the right-{recursive/associative} version. However, the actual code in a recursive descent parser could work either way:

# Recognize A_list == { A // sep }
# Iterative version, left associative
def parse_A_list():
  x = parse_A()
  while next() == sep:
    accept(sep)
    x = new_A_list(x, parse_A())

// Recursive version, right associative
def parse_A_list():
  x = parse_A()
  if next() == sep:
    accept(sep)
    x = new_A_list(x, parse_A_list())
  return x

Notes

In Modern Programming Languages, 2d Edition, Chapter 3, page 40, the author suggests that you could:

Define a convention: for example, that the form
<exp> ::= <mulexp> { + <mulexp>}
will be used only for left-associative operators

although he also suggests the strategy of explicitly describing associativity in an associated paragraph of English text.

I don't know of any commonly-used formalism which uses the first convention (and I find it unnatural) but it at least illustrates the point that EBNF by itself does not adequately define associativity of repetition.
For example, the Syntax Definition Format (SDF) uses braces to indicate a separated repetition; the last element inside the brace must be a non-terminal and is used as the separator. You still need a repetition operator, either * or +, so what I wrote as { A // a } would be written in SDF as { A a }+, whereas I would write there { A a }* as [{ A // a }]. These differences are not particularly relevant.
As another example, W3C uses (or used at some time) a variant of the ABNF (RFC 5234) formalism in which #X indicates a list of Xs separated specifically by commas. (And also n#mX for numbers n and m indicating the minimum and maximum number of repetitions.)