I am facing a problem with rules that with long look ahead
Take as an example a parser that consumes integer or fractions, also the fractions have no GAP
s in the numerator (however it may have GAP
s between the numerator and the SLASH
)
The regular expression [0-9_ ]*?([0-9]+\/[0-9_ ]+)|[0-9_ ]+
describes the valid inputs
you can check some examples here.
Here is one way to write it
Value: Integer | Fraction;
Fraction: IntegerTokenStar DigitPlus GapStar SLASH IntegerToken
DigitPlus: DIGIT DigitPlus | DIGIT
GapStar: GAP GapStar | %empty
Integer: IntegerTokenPlus
IntegerToken: DIGIT | GAP
IntegerTokenStar: IntegerToken IntegerTokenStar | %empty
IntegerTokenPlus: IntegerToken IntegerTokenPlus | IntegerToken
But it will fail to parse even an example like 0 0/0
, the IntegerTokenStar will consume as much as it can, then trying to parse the numerator there is no digit available, trying to continue with integer is not possible because it has a '/'.
How to write this in a conceptually clear way and that we can produce a valid parser.
A few strings and the expected (i)nteger part, (n)umerator, (d)enominator.
1_1_ 1___/1_1 -> fraction {i:"1_1_ ",n:"1___", d:"1_1"}
1_1_ 1___1_1 -> integer {i:"1_1_ 1___1_1",n:"", d:""}
1_1_1___/1_1 -> fraction {i:"",n:"1_1_1___",d:"1_1"}
%define parse.error verbose
%locations
%{
void yyerror(const char* s);
extern int yylex();
extern int yylineno;
extern int yycolumn;
#include <stdio.h>
#include <stdlib.h>
%}
%token DIGIT SLASH GAP NEWLINE
%start File
%%
File: Value | Value NEWLINE File
Value: Integer | Fraction;
Fraction: IntegerTokenStar DigitPlus SLASH IntegerToken
DigitPlus: DIGIT DigitPlus | DIGIT
Integer: IntegerTokenPlus
IntegerToken: DIGIT | GAP
IntegerTokenStar: IntegerToken IntegerTokenStar | %empty
IntegerTokenPlus: IntegerToken IntegerTokenPlus | IntegerToken
%%
int main(){
yyparse();
return 0;
}
void yyerror(const char* s) {
fprintf(stderr, "Line %d: %s\n", yylineno, s);
exit(1);
}
%option noyywrap yylineno
%{
#include <stdio.h>
#include "frac.tab.h"
#define YY_DECL int yylex()
int yycolumn = 1;
#define YY_USER_ACTION yylloc.first_line = yylloc.last_line = yylineno; \
yylloc.first_column = yycolumn; yylloc.last_column = yycolumn + yyleng - 1; \
yycolumn = yytext[0] == '\n' ? 1: yycolumn + yyleng;
%}
%%
[\n] {return NEWLINE;}
[_ ] {return GAP;}
[0-9] {return DIGIT;}
"/" {return SLASH;}
%%
frac: frac.yy.c frac.tab.c
gcc frac.tab.c frac.yy.c -o frac
frac.yy.c: frac.l
flex -o frac.yy.c frac.l
frac.tab.c frac.tab.h: frac.y
bison -d frac.y
The basic problem is that you've split digit sequences with gaps and digit sequences without gaps into two independent rules, which means you need to decide which you're going to match first, which requires (possibly unbounded) lookahead to decide which to match.
The solution is generally to match tokens "bottom up" -- single rule for each thing independent of context, and build up lookahead-dependent things from that. In you case, that means building up IntegerToken
from DigitStar
rather than from DIGIT
directly -- an input of digits will be recognized as a DigitStar
and only when you get to the end of it (and see the non-digit) do you need to decide what it is.
The problem is that the obvious fix for your grammar (changing IntegerToken: DIGIT | GAP
to DigitStar | GAP
) doesn't work because it makes IntegerTokenStar
(and -Plus
) ambigous as any sequence of 2 or more digits might be any number of DigitStar
tokens. So you need to rewrite this to make sure you can't have two consecutive DigitStar
tokens, which turns out to be quite tricky. Your really need to rethink things "bottom up" -- the input is a sequence of alternating numbers (1+ digits each) and gaps (1+ spaces), with an optional single /
that can appear directly between two numbers (no gaps) or a number and a gap (no gap before the /
). So you get rules that look more like:
File: Value | Value NEWLINE File
Value: OptGap Integer OptGap | Fraction ;
Fraction: OptGap Integer GapPlus DigitPlus SLASH OptGap Integer OptGap
| OptGap DigitPlus SLASH OptGap Integer OptGap
DigitPlus: DIGIT DigitPlus | DIGIT
GapPlus: GAP GapPlus | GAP
OptGap: %empty | GapPlus
Integer: Integer GapPlus DigitPlus | DigitPlus
This does the trick, but is needlessly complex as it recognizes 'number' and 'gap'1 tokens in the grammar rather than the lexer. There's also the odd corner case of disallowing a gap between the numerator and the /
in a fraction -- without that we could just ignore spaces (gaps) in the lexer and make things much simpler:
File: Value | Value NEWLINE File
Value: Integer | Fraction ;
Fraction: Integer NUMBER SLASH Integer | NUMBER SLASH Integer
Integer: Integer NUMBER | NUMBER
1Side note -- your lexer seems to disagree with your examples as it treats _
as a GAP token rather than a DIGIT token.
But then your examples don't match your regex -- having a _
immediately before a /
will not match