compiler-construction formal-languages ll-grammar

Which contemporary computer languages are LL(1)?

(I am spending the holiday time on some language theory. Excuse me if this is a naive question.)

According to here:

LL grammars, particularly LL(1) grammars, are of great practical interest, as parsers for these grammars are easy to construct, and many computer languages are designed to be LL(1) for this reason.

So, out of curiosity, which contemporary computer langauges are LL(1) ? Does C, Java ,C# or Python fall into this category?

Solution

I think I'd be tempted to flag that Wikipedia quote with [citation needed]; the assumptions are, at least, questionable. For example, there are a large number of compiler-construction tools based on yacc which make it extremely easy, in practice, to construct a parser using the more powerful (and equally fast) LALR algorithm, and some also implement the even more power (and almost as fast, in most practical grammars) GLR algorithm. Hand-writing parsers has not been necessary for decades. [Note 1]

By way of attempting an answer to the question:

Most computer languages are "technically" not LL because they are not even context-free. That would include languages which require identifiers to be declared (C, C++, C#, Java, etc.), as well as languages with preprocessors and/or macro facilities (C and C++, among others), languages with ambiguities which can only be resolved using semantic information (Perl would be the worst offender here, but C and C++ are also right up there). And, just to spread the joy around some more, it also includes languages which have Python-like layout awareness (Python, of course, and also Haskell).

I put scare quotes around "technically" because there are a lot of people who will say that these exceptions "don't count". That is their opinion, and they are entitled to it (and anyway I've given up arguing about it, although I don't share that opinion). You could, for example, eliminate the preprocessor from C/C++ by preprocessing the text before applying the grammar, or preprocess whitespace-aware languages by inserting INDENT, NEWLINE and DEDENT tokens instead of the whitespace, after which you could make some kind of claim about a mystical "core language". (That is much harder to apply to C++ templates, which can only be eliminated by parsing the text, so you can't claim that the expansion can be done prior to parsing.)

The claim that a language is not context-free because it requires identifiers to be declared is perhaps a bit more controversial. In some languages (not C/C++, in which the semantic analysis is required to avoid ambiguity), you could argue that a parse tree could be constructed without validating the declaration rule, which makes that rule "not syntactic". But it remains the case that you can enforce the declaration rule syntactically; you just cannot do it with a context-free grammar (you could use a Van Wijngaarden grammar, for example).

In any case, a common parsing strategy is to recognize a superset of a language, and then reject non-conforming programs through a subsequent analysis of the resulting parse tree; in that case, the question becomes whether or not the superset is LL. In my opinion, in order for that to be meaningful, it is necessary that every valid program can be parsed to the correct syntax tree, which eliminates the usage of semantic analysis to disambiguate.
The goal of parsing is to produce a syntax tree, not solely to recognize whether a text is valid or not. (You might miss this important fact if you skim over formal language textbooks which tend to focus on recognition, possibly because the construction of syntax trees is less interesting.) So it seems to be reasonable to insist that a proposed grammar actually represent the syntactic structure of the language.

For example, you can recognize algebraic expressions using a simple regular language:
```
(PREFIX* OPERAND POSTFIX*) (INFIX PREFIX* OPERAND POSTFIX*)*
```
where PREFIX is the set of prefix operators as well as (, POSTFIX is the set of postfix operators (if any) as well as ), INFIX is the set of infix operators (addition, subtraction, multiplication, etc.), and OPERAND is an identifier or constant.

That regular expression will not correctly reject expressions with unbalanced parentheses, but we already agreed that it was OK to recognize a superset of the language, right? :-)

If desired, we could remove the parentheses from the PREFIX and POSTFIX sets and make OPERAND a recursive production. The resulting grammar is trivially LL(1) [Note 2]:
```
operand    => IDENTIFIER | CONSTANT | '(' expression ')'
term       => operand | PREFIX operand | operand POSTFIX
expression => term | term INFIX expression
```
The problem, however, is that this grammar does not capture operator precedence. It does not even attempt to recognize the fact that a+b*c and a*b+c are both sums, so that the top-level operator is + in both cases, and not whatever operator happens to come first in the expression. (If you were parsing APL, this wouldn't be an issue. But most languages conform to the usual expectations about operator precedence.)

This point is important because an LL grammar cannot handle left-recursive productions, and you need a left-recursive production in order to correctly parse a left-associative operator. (That is, to correctly parse a-b-c as ((a-b)-c) rather than (a-(b-c)), which would have a different value.) Again, you could argue that this is a quibble, since you could post-process the parse tree in order to correct the associativity. This is certainly true, but the result is that the grammar you use to parse is different from the grammar you use to explain the syntax of the language. This might or might not bother you.

It's worth adding here that there are languages (Haskell, Prolog, probably many more) in which you can define operators and their precedence in the program text. That obviously makes it impossible to generate a correct syntax tree without semantic analysis, and the usual approach to parsing such languages is to use precisely the simplified "alternating operand and operator" language outlined above. Once the operator precedences are all known, presumably at the end of the parse, all expressions are then reanalyzed using something like the Shunting Yard Algorithm, in order to produce the correct parse.
Let's suppose we discard the above objections, and just ask "for which commonly used programming languages might an LL parser be used?"

In order to avoid cheating, though, we should really require that the LL parser have fixed lookahead and not require backtracking. If you allow arbitrary lookahead and backtracking, then you considerably expand the domain of parseable languages, but I contend that you are no longer in the realm of LL. That will eliminate, for example, the template- and preprocessor-free subset of C++, even though the common compiler implementations use recursive descent parsers, since backtracking is required in order to resolve the "Most Vexing Parse" ambiguity. (Anyway, you can't really remove templates from C++, and parsing with templates is really hard. [Note 3])

Java and Python were both designed to be LL(1) "pseudo-parseable". I believe Haskell falls into this category as well. C cannot be correctly parsed without semantic information (classic ambiguity: is (x)*(y) a cast or a multiplication? -- it depends on whether x has been typedef'ed or declared as a variable) but I'm pretty sure it can be captured with a non-backtracking recursive-descent parser. I haven't looked at C# in depth, but this answer by Eric Lippert suggests that it requires backtracking in some cases.

Notes

Of course, people still do it, and in many cases for good reasons (producing better error messages, for example). But "it's difficult to construct an LALR parser" is not a good reason, since it's not.
That's not quite LL, because I didn't left-factor. But it's easy enough to do; I'll leave it as an exercise.
See Is C++ context-free or context-sensitive?. Also Todd Veldhuizen's classic C++ Templates are Turing Complete