Search code examples
bisonflex-lexeryacclexrule

Bison accept input after rule done


I want to parse text for single query. This query will end with semicolon. It will be like sql. Ex: CREATE TABLE 'somename'; My y file is

%{
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <stdbool.h>
#include "ast.h"

extern int yylex(void);
extern void yyerror(char const *msg);

QueryNode *queryNode;

%}

%union {
int integer;
char *str;
char chr;
bool boolean;
int intval;
char *strval;
ObjectTypeNode *objectTypeNode;
CreateClauseNode *createClauseNode;
QueryNode *queryNode;
}

%token  NUMBER
%token  INTNUM

%token<str> CREATE_KEYWORD
%token<str> DATABASE_KEYWORD
%token<str> TABLE_KEYWORD
%token<str> LETTER
%token<str> STRING
%token<str> IDENTIFIER
%token<chr> LEFT_BRACKET RIGHT_BRACKET COMMA SEMICOLON EOL

%type<objectTypeNode> object_type
%type<createClauseNode> create_clause
%type<queryNode> query

%start input

%%
input:      SEMICOLON EOL                               { queryNode = NULL; }
        |   query   SEMICOLON EOL                       { queryNode = $1; }
        ;

query:  create_clause                                   { $$ = CreateQueryNode($1, CREATE_CLAUSE_TYPE); }
        ;

create_clause:  CREATE_KEYWORD  object_type STRING      { $$ = CreateCreateClauseNode($2, $3); }
                ;

object_type:    DATABASE_KEYWORD                        { $$ = CreateObjectTypeNode(DATABASE_OBJECT); }
            |   TABLE_KEYWORD                           { $$ = CreateObjectTypeNode(TABLE_OBJECT); }
            ;
%%
void yyerror(char const *msg) {
    printf("Error: %s\n", msg);
}

And my l file is

%{
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <stdarg.h>
#include "ast.h"
#include "y.tab.h"
%}

%option noyywrap nodefault yylineno case-insensitive

%%
CREATE                  { yylval.strval = "create"; return CREATE_KEYWORD; }
DATABASE                { return DATABASE_KEYWORD; }
TABLE                   { return TABLE_KEYWORD; }
"("                     { return LEFT_BRACKET; }
")"                     { return RIGHT_BRACKET; }
";"                     { return SEMICOLON; }

-?[0-9]+                { yylval.intval = atoi(yytext); return INTNUM; }

L?'(\\.|[^\\'])+'   |
L?\"(\\.|[^\\"])*\"     { yylval.strval = yytext;   return STRING; }

[a-zA-Z]+[0-9]*         { return IDENTIFIER; }
[a-zA-Z]+               { return LETTER; }
[\n]                    { printf("eol\n"); return EOL; }
[ \t\f\v]               { ; }

.                       { return *yytext; }
%%

I using yyparse() function in my other main function. main.c file is

#include <stdio.h>
#include <stdlib.h>
#include "ast.h"
#include "y.tab.h"

extern QueryNode *queryNode;

int main(int argc, char *argv[]) {
    int result = yyparse();
    if(result == 0 && queryNode != NULL) {
        printf("AST created\n");
    } else {
        printf("Problem!\n");
    }
    return 0;
}

When I input as CREATE TABLE 'testo'; yyparse don't terminate and program waiting in int result = yyparse(); line. How can I fix it? I using flex and bison. I want to terminate with this input.


Solution

  • In the original version of this question, the main rules in the grammar specification were:

    input: SEMICOLON { queryNode = NULL; YYACCEPT; } | query SEMICOLON { queryNode = $1; YYACCEPT; } ;

    As I said in the original version of this answer, those rules guarantee that a query followed by a semi-colon will be accepted by yacc as soon as the semicolon is encountered, because of the YYACCEPT action:

    yacc "accepts" because you used YYACCEPT in an action. YYACCEPT means "as soon as this production is recognised, accept the input even if it has not been fully consumed." So it is doing what you asked it to.

    I then suggested removing the YYACCEPT actions so that the parser wouldn't return until end-of-input is signaled by the lexer:

    If you only want to accept input if the entire input matches the grammar, just don't call YYACCEPT. Yacc will automatically accept if the start production matches and the next token is the end-of-input marker.

    But of course that doesn't magically cause reading to stop when a newline character is encountered. All it does is ensure that if the entire input is a single command, it will be accepted and otherwise it will be rejected. But since it is checking to make sure that nothing follows the command, it will continue to request input until it gets some.

    If you want the lexer to read only a single line which must be a valid command, you can easily do that by removing YYACCEPT from the parser actions and have the scanner return an end-of-file indication when it sees a newline character:

    \n    { return 0; }
    

    (Returning zero is how the scanner signals end-of-input.)

    If what you really want is to build a program which reads multiple lines of input, parsing each line independently and returning after each one, then the above solution will work fine.

    You could also play games in the parser, as with your new proposal, but having the scanner return a newline token when it sees a newline. Then you could accept or reject the input when the newline token is received, using YYACCEPT, YYABORT and an error production:

    input: SEMICOLON EOL              { queryNode = NULL; YYACCEPT; }
         | query SEMICOLON EOL        { queryNode = $1;   YYACCEPT; }
         | query error EOL            { YYABORT; }
         ;
    

    The error production is necessary in order to flush the rest of the line when a syntax error is encountered. Otherwise, the next call to the parser will start in the middle of the line which produced the error, at a slightly unpredictable point (because it will depend on whether the parser was holding a lookahead token when it signalled an error.)

    While this solution does have some advantages, it is somewhat more complex than the one which just returns 0 when a newline is read. So it is hard to justify the extra complexity.

    In any event, neither of these solutions is really ideal. At some point, you will almost certainly need to handle inputs which are too long to conveniently type in a single line.


    Now that you have included you complete scanner, I can see that you will have another serious problem, because you don't copy the token string before storing it in yylval. Retaining the address of the token (which is part of the scanner's internal input buffer) is not correct; that buffer will be changed without warning by the scanner (for example, when it needs more input). In particular, as soon as the scanner starts working on the next token, it will overwrite the NUL byte it had previous used to terminate the token, which will have the apparent affect that the token's string changes to two (or more) consecutive tokens. You can find a number of discussions about this problem on this site.