Search code examples
jqueryparsingbisonflex-lexer

How can I improve the following grammar?


I am trying to find out where I went wrong in the below code.

Flex input:

%{
        #include "jq.tab.h"
        void yyerror(char *);
%}
method          add|map|.. and other methods go here

%%

"/*"            { return CS; }

"*/"            { return CE; }

"jQuery"        {
                printf("%s is yytext\n", yytext);
                return *yytext;
                }

"args"          { return ARGUMENT; }

{method}        { return METHOD; }

[().\n]         { return *yytext; }

[ \t]+          { return WS; }

.               { return IGNORE; }

%%

int yywrap(void) {
        return 1;
}

Bison input:

%{
        #include <stdio.h>
        int yylex(void);
        void yyerror(char *);
%}

%token ARGUMENT METHOD IGNORE WS CS CE
%error-verbose

%%

stmts:
        stmt '\n'               { printf("A single stmt\n"); }
        | stmt '\n' stmts       { printf("Multi stmts\n"); }
        ;

stmt:
        jQuerycall                      { printf("A complete call ends here\n"); }
        | ignorechars                   { printf("Ignoring\n"); }
        | ignorechars WS jQuerycall     { printf("ignore+js\n"); }
        | jQuerycall WS ignorechars     { printf("js+ignore\n"); }
        | optionalws stmt optionalws
        | CS stmt CE                    { printf("comment\n"); }
        ;

jQuerycall:
        'jQuery' '(' ARGUMENT ')' '.' methodchain       { printf("args n methodchain\n"); }
        | 'jQuery' '(' ')' '.' methodchain              { printf("methodchain\n"); }
        | 'jQuery' '(' ARGUMENT ')'                     { printf("args\n"); }
        | 'jQuery' '(' ')'                              { printf("empty call\n"); }
        ;

methodchain:
        methodchain '.' methodcall
        | methodcall
        ;

methodcall:
        METHOD '(' ')'
        ;

ignorechars:
        IGNORE
        | IGNORE optionalws ignorechars
        ;

optionalws:
        | WS
        ;

%%

void yyerror(char *s) {
        fprintf(stderr, "%s\n", s);
}

int main(void) {
        yyparse();
        return 0;
}

My aim is to recognize any jQuery call with all its elements, and ignore any other statement / string. Also ignore comments. Right now, this code makes many assumptions - like 'args' to be the only selector element inside jQuery().

Edit

I am using the following input-output cases. Cases like 10 and 12 are the ones I'm trying to figure out:

> 1.input: statement\n output: Ignoring
> 
> 2.input: statement statement\n output: Ignoring
> 
> 3.input: statement statement statement\n output: Ignoring
> 
> 4.input: jQuery()\n output: jQuery is yytext empty call A complete call ends here
> 
> 5.input: jQuery(args)\n output: jQuery is yytext args A complete call ends here
> 
> 6.input: jQuery().add()\n output: jQuery is yytext methodchain A complete call ends here
> 
> 7.input: jQuery(args).add().map()\n output: jQuery is yytext args n methodchain A complete call ends here
> 
> 8.input: /*comment*/\n output: Ignoring comment
> 
> 9.input: /*jQuery()*/\n output: jQuery is yytext empty call A complete call ends here comment
> 
> 10.input: /* comment */\n output: syntax error, unexpected CE, expecting IGNORE
> 
> 11.input: var a = b\n output: Ignoring
> 
> 12.input: var a = jQuery(args)\n output: jQuery is yytext syntax error, unexpected 'jQuery', expecting IGNORE

Solution

  • In your lex file, the rule:

    "jQuery"        {
                    printf("%s is yytext\n", yytext);
                    return *yytext;
                    }
    

    returns the token 'j' when it sees an input string of jQuery. Since your bison file never does anything with the token 'j' this will generally give you a syntax error.

    You need to add JQUERY to your %token declaration and have this lex rule return that.

    edit

    Usually a comment can appear anywhere in the program (between any two other tokens) and are completely ignored. So the easiest way to deal with them is in the lexer:

    %x comment
    %%
    "/*"           { BEGIN comment; }
    <comment>.     ;
    <comment>"*/"  { BEGIN 0; }
    

    this will skip over comments (returning no tokens at all), so the grammar doesn't need to worry about them. If you don't want to use a lexer start state, you could instead use the complex regex:

    "/*"([^*]|\*+[^*/])*\*+"/"          ;