Can Lex and Yacc, together, lex and parse both Lex and Yacc?
In other words, is it possible to write a Lex/Yacc combination that is self hosting, generating their own parsers?
EDIT: I do not mean that the combination needs to parse the C parts of the input fully. In particular, it does not need to deal with the typedef-name/identifier conflict, or build a full symbol table (though I believe that both can be dealt with with C code in the lexer and parser). This is because the C code is copied verbatim to the output.
EDIT 2: Basically, do Lex and Yacc (the languages, not the programs) have LALR(1) grammers?
Well, the answer is actually more complex than what the first answers reported. I claim that, no, YACC cannot parse YACC, but evil is in the details.
The natural grammar for YACC includes:
gram: %empty | gram rule
rule: ID ":" rhs
rhs: %empty | rhs ID
(using Bison-like conventions: ":"
is a valueless token, ID
is the token for identifiers, and %empty
emphasizes rules with an empty RHS).
It's not hard to see that this grammar is not LR(1). Being LR(1) roughly means that when your cursor advances, you know in which rule you are just by looking at the next token. Then consider the following (valid) input:
ID : ID ID
ID : ID
ID :
can you spot easily when starts a rule and when begins the next one? Well, too easy this way, try to find out as YACC would see it:
ID : ID ID ID : ID ID :
How do you know when the first rule is finished? Consider the position of the cursor as a .
below:
ID : ID . ID ID : ID ID :
Is the first rule finished? Of course not. And what about the next step?
ID : ID ID . ID : ID ID :
Is the first rule finished? Yes obviously. But in both case the next token (aka the "lookahead") is ID
; it is the following token which helps deciding whether the current rule is finished (when ":"
) or not (ID
). In other words 1 lookahead is not enough, this grammar is not LR(1), and therefore not LALR(1) either (which is the class of grammars that YACC accepts). Clearly it is LR(2).
It would be extremely easy to make this grammar LR(1) had we accepted to change the language and require a terminating ;
for rules:
ID : ID ID ; ID : ID ; ID : ;
unfortunately it's too late and POSIX did not decide to make this semicolon mandatory, so implementations of YACC must deal with this LR(2) grammar.
In the case of Bison, it uses a neat/dirty (depends on your viewpoint) trick to deal with it: the scanner is taught to make the difference between a "regular" identifier (ID) and an identifier followed by a colon (ID_COLON). Then the flow of tokens reads
ID_COLON ID ID ID_COLON ID ID_COLON
and this is obviously LR(1).
So why wasn't the semicolon required from the inception of YACC? Well, for obvious bootstrapping reasons YACC cannot be written first in YACC, so I guess S. Johnson never realized that the parser he wrote by hand actually accepted a grammar which is harder than LR(1). Bison and byacc come from a common clone whose parser was also written by hand. Bison's own parser was bootstrapped much later, in 2002 (commit e9955c83734d0a545d7822a1feb9c4a8038a62cb).
Much of what I have written can be debated depending on how you see things. I claim the "natural" grammar is not LR(1), but I don't think one may define "natural" here. So yes, others may say that the grammar of YACC is LR(1). But in the end, we are all right, since if a language is LR(k), then it is LR(1) (i.e., an LR(k) grammar can be turned into a LR(1) monster grammar, which is proved using a similar dirty/neat trick).
Wrt Flex, well, yes, it is bootstrapped, but it's also somewhat untrue to say so: it overall syntax is really simple and easily described by regexps. However the regexps themselves escape the realm of mere regular languages: there are parentheses! So somewhere inside Lex/Flex, there must be a real parser, be it generated or hand written, to parse the regexps.
Asking if Lex is bootstrapped is similar to asking whether the C preprocessor is bootstrapped: yes it is, I'm sure there are #include
s in there :)