Search code examples
parsinggrammaryaccformal-languages

What is the class of languages that can be parsed with yacc?


I recently learned that C does not have a context-free grammar. I also recently learned that gcc used to use yacc to parse C. The manual for the the yacc utility states "The class of specifications accepted [by yacc] is a very general one: LALR(1) grammars with disambiguating rules", while Wikipedia states that LALR grammars are a subset of deterministic context-free grammars, which are a subset of the context-free grammars. If C is not even context-free (much less a deterministic context-free language), and yet yacc can parse C, then what class of languages can yacc parse, if not the subset of context-free languages that have LALR(1) grammars?


Solution

  • Yacc generates computer programs, which are pretty well Turing complete. The yacc framework uses an LALR(1) framework to trigger actions, but the actions are arbitrary code.

    Moreover, the input to yacc is a stream of tokens, not a direct input. The stream of tokens is produced by another computer program written in a Turing complete language, which can also manipulate its input in ways not restricted to context-free transcoding.

    Finally, nothing stops a yacc-generated parser from initially accepting a superset of the intended language and then later analysing the context-free parse tree and rejecting certain constructs based on arbitrary computations, such as insisting that variables be declared before use (a context-sensitive computation).

    In short, real-world parsers are pragmatically written programs, not theoretical academic exercises. Languages parsed by bison/yacc are generally "mostly" LALR(1), and their lexical analysis is generally "mostly" regular, but don't be surprised when the computer programs exploit their full power to transcend these limits.

    That's what makes programming the interesting activity it is.

    None of that makes the academic theory less useful. Bison/yacc and other parser generators take a lot of the gruntwork out of building parsers, because they can handle "most of" the analysis. And the closer a language comes to the analysable context-free model, the easier it is to generate ("most of") other useful tools: linters, syntax highlighters, reformatters, indexers, static analysers, documentation extractors, etc., etc. To say nothing of functioning as documentation for the syntax of the language itself.