I am trying to use flex and bison to create a compiler for a simple language called "FUNC". but it gives me a segmentation fault and after spending hours on it, i still haven't been able to fix it. would really appreciate if you guys could help. thanks!
Flex FIle "func.lex"
%{
#include "tokens.h"
//#include "y.tab.h"
%}
DIGIT [0-9]
IDENT [a-zA-Z][A-Za-z0-9]*
%%
"function" {return FUNCTION;}
"returns" {return RETURNS;}
"begin" {return BEGIN;}
"end" {return END;}
"read" {return READ;}
"write" {return WRITE;}
"if" {return IF;}
"then" {return THEN;}
"else" {return ELSE;}
"variables" {return VARIABLES;}
"while" {return WHILE;}
"loop" {return LOOP;}
"Less" {return LESS;}
"LessEq" {return LESSEQ;}
"Eq" {return EQ;}
"NEq" {return NEQ;}
"(" {return LB;}
")" {return RB;}
"Plus" {return PLUS;}
"Times" {return TIMES;}
"Minus" {return MINUS;}
"Divide" {return DIVIDE;}
"," {return COMMA;}
":=" {return ASSIGN;}
";" {return SEMI;}
{DIGIT}+ {return NUMBER;}
{IDENT} {return NAME;}
<<EOF>> {return EOF;}
[ \t\n]+ /* eat up whitespace */
%%
int yywrap() { return EOF; }
Yacc File "func.y"
%{
//#include "tokens.h"
#include <stdio.h>
#include <stdlib.h>
extern FILE * yyin;
extern char * yytext;
extern int yylex(void);
extern int yyparse();
void yyerror( const char *s);
int yylex(void);
int symb;
%}
/*****************bison declarations**********************/
%union //defining all possible semantic data types (strings and digits)
{
int NUMBER;
char * NAME;
_Bool COND; //return value of conditional expressions. one of our $$ can have value 0 or 1
}
%start program
%type <NUMBER> NUMBER
%type <NAME> NAME
%token FUNCTION RETURNS VARIABLES BEGIN END COMMA SEMI ASSIGN
READ WRITE
IF THEN ELSE
WHILE LOOP
LB RB
LESS LESSEQ EQ NEQ
PLUS MINUS TIMES DIVIDE
NAME NUMBER //same case as that used in "operations" below (same as FUNC syntax)
%%
//grammar rules
program: funcs
; //<program> ::= <funcs>
funcs: func ";" //<funcs> ::= <func>; [<funcs>]
|func ";" funcs
;
func: FUNCTION NAME "("")" BEGIN commands END FUNCTION /*<func> ::= function <name>([<args>])[returns <name>] [variables <args>] begin <commands> end function*/
|FUNCTION NAME "(" args ")" BEGIN commands END FUNCTION
|FUNCTION NAME "(" args ")" RETURNS NAME BEGIN commands END FUNCTION
|FUNCTION NAME "("")" RETURNS NAME BEGIN commands END FUNCTION
|FUNCTION NAME "("")" BEGIN commands VARIABLES args END FUNCTION
|FUNCTION NAME "(" args ")" BEGIN commands VARIABLES args END FUNCTION
|FUNCTION NAME "("")" RETURNS NAME BEGIN commands VARIABLES args END FUNCTION
|FUNCTION NAME "(" args ")" RETURNS NAME BEGIN commands VARIABLES args END FUNCTION
;
args: NAME //<args> ::= <name> [,<args>]
|NAME "," args
;
commands: command ";" //<commands> ::= <command>; [<commands>]
|command ";" commands
;
command: assign //<command> ::= <assign> | <if> | <while> | read <name> | write <expr>
|if
|while
|read
|write
;
assign: NAME ":=" expr {$<NAME>$=$1=$<NUMBER>3;} //<assign> ::= <name> := <expr>
//assign: NAME ASSIGN expr {$1=$3;}
;
if: IF condexpr THEN commands END IF //<if> ::= if <condexpr> then <commands> [else <commands>] end if
|IF condexpr THEN commands ELSE commands END IF
;
while: WHILE condexpr LOOP commands END LOOP
; //<for> ::= while <condexpr> loop <commands> end loop
read: READ NUMBER
|READ NAME
;
write: WRITE expr
;
condexpr: bop "(" expr "," expr ")"
; //<condexpr> ::= <bop> ( <exprs> )
bop: LESS //<bop> ::= Less | LessEq | Eq | NEq
|LESSEQ
|EQ
|NEQ
;
Less: LESS "(" NUMBER "," NUMBER ")" {if($<NUMBER>3<$<NUMBER>5)$<COND>$=1;} ;
LessEq: LESSEQ "(" NUMBER "," NUMBER ")" {if($<NUMBER>3<=$<NUMBER>5)$<COND>$=1;} ;
Eq: EQ "(" NUMBER "," NUMBER ")" {if($<NUMBER>3=$<NUMBER>5)$<COND>$=1;} ;
NEq: NEQ "(" NUMBER "," NUMBER ")" {if($<NUMBER>3!=$<NUMBER>5)$<COND>$=1;} ;
exprs: expr //<expr> [,<exprs>]
|expr "," exprs
;
expr: NAME
|NUMBER
|NAME "(" exprs ")" //<name>[( <exprs> )] | <number>
;
/***************idk if we need this. dunno which file to describe these operations in ***********************/
Plus: PLUS "(" NUMBER "," NUMBER ")" {$<NUMBER>$=$3+$5; } ; //S1=plus $2=( $3=expr $4= $5=expr $6=)
Minus: MINUS "(" NUMBER "," NUMBER ")" {$<NUMBER>$=$3-$5; } ;
Times: TIMES "(" NUMBER "," NUMBER ")" {$<NUMBER>$=$3*$5; } ;
Divide: DIVIDE "(" NUMBER "," NUMBER ")" {$<NUMBER>$=$3/$5; } ;
%%
//c code
/*
int main(int c, char * * argv) {
if ((yyin = fopen(argv[1], "r")) == NULL) {
printf("can't open %s\n", argv[1]);
exit(0);
}
symb = yylex();
yyparse();
// program(1);
fclose(yyin);
}
*/
int main (char * * argv)
{
if ((yyin = fopen(argv[1], "r")) == NULL) {
printf("can't open %s\n", argv[1]);
exit(0);
}
yylex();
}
void yyerror(const char *s)
{
extern int yylineno; // defined and maintained in lex.c
extern char *yytext; // defined and maintained in lex.c
/*std::cerr << "ERROR: " << s << " at symbol \"" << yytext;
std::cerr << "\" on line " << yylineno << std::endl;
exit(1);*/
printf("parse error Message: ", s);
fflush(1);
exit(-1);
}
/*int yyerror(char *s)
{
return yyerror(string(s));
}*/
Your lexical definition has various errors; I don't know the extent to which any of them are contributing to your problem because of the lack of details about the problem, so I will just list them:
%{
#include "tokens.h"
//#include "y.tab.h"
%}
The bison-generated header file contains the definition of YYSTYPE
and it is essential that the parser and the scanner agree on this definition. It also contains the correct definitions of the various tokens, which also must be the same in both files. You don't show the contents of token.h
but its use does not provide any confidence; if you did that in order to mask some other problem, fix the other problem before proceeding.
<<EOF>> {return EOF;}
The agreement between the lexer and the scanner is that the lexer will return the token id 0 to signal the end of input. The value of EOF
is normally -1, which is not a valid token number (token numbers are non-negative integers, and will not be handled correctly by the generated parser. By default, (f)lex inserts an appropriate default end of file rule which does the right thing, and you should rely on that behaviour.
A good habit is to place the following definition in the prologue of your (f)lex definition:
%option noyywrap noinput nounput nodefault
(Unless you need one of those features, and you should know which ones you need.) The noyywrap
option removes the code which calls yywrap
from the generated lexer, so that the lexer immediately returns an end of input indication when it encounters an EOF from the input stream. noinput
and nounput
remove the definitions of the input()
and unput()
functions, which will cause compiler warnings if they are unused by your lexer actions. (By the way, you do compile with compiler warnings enabled, right? Not enabling compiler warnings is an excellent way to ignore the fact that you are shooting yourself in the foot.)
The nodefault
option removes the (f)lex-generated default rule for unrecognised input characters, and warns you if it is possible that some input character is unrecognised. (This does not affect the default <<EOF>>
action. The default flex action on unrecognised input is ECHO
, which means that unrecognised characters will simply be sent to standard output without generating any kind of error message. That is (almost) never what you want, and can also serve to mask real errors.
If you do use yywrap
, the conventional return value to indicate end of input is 1
("true"), not EOF
, although EOF
will work.
In your parser, you claim that the NAME
token has a semantic value of type NAME
. (It is extremely unwise to reuse the token name as a tagname; you should fix that.) However, your flex action which returns a NAME
token does not fill in the semantic value. The most likely consequence is that the parser will receive a NULL
when it is expecting a valid char*
, which could certainly lead to a segfault.
There are also some issues with your parser definition, which should be corrected. First is the point I raised above about the tagnames in your %union
declaration. And second, you do not need to #include "y.tab.h"
in the parser code because its contents have already been inserted.
The most confusing part of your parser code is your use of the named type syntax $<tag>1
throughout the grammar. Don't do that. You should correctly declare all tokens and non-terminals with their correct types:
%token <str> NAME
%type <number> expr
(Assuming a more standard set of tagnames.) If you provide an explicit tagname, the generated parser will use that tagname, thereby bypassing type safety checks. (And, obviously, if you bypass a type safety check, it's much easier to use the incorrect union member in a rule, leading to who know's what consequences)