Search code examples
parsingyaccjflexcup

Syntax fragment (include or import) in YACC


Is it possible to include/import yacc fragment files from different files to a main YACC?

Just to exemplify what I'm looking for, I would like to create 3 syntax parsers for 3 different files, but they'd share a common syntax block.

So, I'd like to keep this syntax in only one yacc fragment file, so I could maintain it better.


Solution

  • Recent versions of GNU Bison support multiple grammar start symbols. Thus you can define one grammar file that has multiple languages in it, which share some common syntax rules.

    Using some external preprocessing tool, you could include common grammar material in multiple Yacc files. This kind of inclusion is not directly supported in any Yacc-like program I know. Of course, the C #include is supported, but that's passed through to the C code, not processed by Yacc.

    In any Yacc, you can simulate the feature of having multiple start symbols. It can be done by handling secret phrase structure rules in your top-level rule, which are delimited by secret tokens.

    You need a YYINPUT operator that lets you stuff these secret tokens between your parser and scanner to "prime" the scan so that the parser will see the secret tokens and recognize the rule headed by them. When the parser calls yylex(), it has to first obtain the secret tokens that have been injected; when those are exhausted, then call the real scanner.

    Secret tokens are abstract token values that are not produced by the scanner; they are purely internal.

    You can see the technique in this grammar file; look for rules that contain SECRET_ESCAPE_R and other similar terminal symbols.

    The example SECRET_ESCAPE_R stands for regex; the entry point it creates is used for parsing a regular expression. In this parser.c file, there is a regex_parse function which calls parse with an argument enum value prime_regex. This prime_regex enum value tells parse to prepare the SECRET_ESCAPE_R token. parse is found back in the grammar file again, toward the bottom. It uses helpers prime_parser and primer_parser_post again found in prime.c.

    The priming mechanism in this project handles not only parsing subgrammars, but also the parsing of multiple expressions (or other units) from the same stream. This is also something not supported nicely by Lex and Yacc "out of the box".
    In certain languages, when you try to read a single expression (or definition, declaration or other unit) of a langauge, what will happen is that the Yacc parser will read ahead by one token. (Pesky LALR(1) likes to do that.) In other words, in extracting one expression, it ends up consuming a token of the following one.

    The prime_parser function notices that there was a previous parse which finished with a certain lookahead token. That token is pushed back into the stream first, and then any secret tokens which will guide the parse to a desired subgrammar.