I'm writing a parser with error handling. I would like to output to the user the exact location of the parts of the input that couldn't be parsed.
However, the location of the error token always starts at 0, even if before it were parts that were parsed successfully.
Here's a heavily simplified example of what I did. (The problematic part is probably in the parser.yy.)
Location.hh:
#pragma once
#include <string>
// The full version tracks position in bytes, line number and offset in the current line.
// Here however, I've shortened it to line number only.
struct Location
{
int beginning, ending;
operator std::string() const { return std::to_string(beginning) + '-' + std::to_string(ending); }
};
LexerClass.hh:
#pragma once
#include <istream>
#include <string>
#if ! defined(yyFlexLexerOnce)
#include <FlexLexer.h>
#endif
#include "Location.hh"
class LexerClass : public yyFlexLexer
{
int currentPosition = 0;
protected:
std::string *yylval = nullptr;
Location *yylloc = nullptr;
public:
LexerClass(std::istream &in) : yyFlexLexer(&in) {}
[[nodiscard]] int yylex(std::string *const lval, Location *const lloc);
void onNewLine() { yylloc->beginning = yylloc->ending = ++currentPosition; }
};
lexer.ll:
%{
#include "./parser.hh"
#include "./LexerClass.hh"
#undef YY_DECL
#define YY_DECL int LexerClass::yylex(std::string *const lval, Location *const lloc)
%}
%option c++ noyywrap
%option yyclass="LexerClass"
%%
%{
yylval = lval;
yylloc = lloc;
%}
[[:blank:]] ;
\n { onNewLine(); }
[0-9] { return yy::Parser::token::DIGIT; }
. { return yytext[0]; }
parser.yy:
%language "c++"
%code requires {
#include "LexerClass.hh"
#include "Location.hh"
}
%define api.parser.class {Parser}
%define api.value.type {std::string}
%define api.location.type {Location}
%parse-param {LexerClass &lexer}
%defines
%code {
template<typename RHS>
void calcLocation(Location ¤t, const RHS &rhs, const int n);
#define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
#define yylex lexer.yylex
}
%token DIGIT
%%
numbers:
%empty
| numbers number ';' { std::cout << std::string(@number) << "\tnumber" << std::endl; }
| error ';' { yyerrok; std::cerr << std::string(@error) << "\terror context" << std::endl; }
;
number:
DIGIT {}
| number DIGIT {}
;
%%
#include <iostream>
template<typename RHS>
inline void calcLocation(Location ¤t, const RHS &rhs, const int n)
{
current = (n <= 1)
? YYRHSLOC(rhs, n)
: Location{YYRHSLOC(rhs, 1).beginning, YYRHSLOC(rhs, n).ending};
}
void yy::Parser::error(const Location &location, const std::string &message)
{
std::cout << std::string(location) << "\terror: " << message << std::endl;
}
int main()
{
LexerClass lexer(std::cin);
yy::Parser parser(lexer);
return parser();
}
For the input:
123
456
789;
123;
089
xxx
123;
765
432;
expected output:
0-2 number
3-3 number
5-5 error: syntax error
4-6 error context
7-8 number
actual output:
0-2 number
3-3 number
5-5 error: syntax error
0-6 error context
7-8 number
I'm building upon the rici's answer, so read that one first.
Let's consider the rule:
numbers:
%empty
| numbers number ';'
| error ';' { yyerrok; }
;
This means the nonterminal numbers
can be one of these three things:
number
preceded by any valid numbers
.error
.Do you see the problem yet?
The whole numbers
has to be an error
, from the beginning; there is no rule saying that anything else allowed before it.
Of course Bison obediently complies to your wishes and makes the error
start at the very beginning of the nonterminal numbers
.
It can do that because error
is a jack of all trades and there can be no rule about what can be included inside of it. Bison, to fulfill your rule, needs to extend the error
over all previous numbers
.
When you understand the problem, fixing it is rather easy. You just need to tell Bison that numbers
are allowed before the error
:
numbers:
%empty
| numbers number ';'
| numbers error ';' { yyerrok; }
;
This is IMO the best solution. There is another approach, though.
You can move the error
token to the number
:
numbers:
%empty
| numbers number ';' { yyerrok; }
;
number:
DIGIT
| number DIGIT
| error
;
Notice that yyerrok
needs to stay in numbers
because the parser would enter an infinite loop if you place it next to a rule that ends with token error
.
A disadvantage of this approach is that if you place an action next to this error
, it will be triggered multiple times (more or less once per every illegal terminal).
Maybe in some situations this is preferable but generally I suggest using the first way of solving the issue.