Search code examples
javascriptparsingbisonflex-lexerjison

Ask Jison to ignore some unnecessary details


I am authoring a parser using Jison which allows me to parse a fairly complex expression language. In this language, it supports grammar like:

stats_expression
  : stats_function '(' eval_expression ')'
  | other_stats_aggregation
  ;

stats_function
  : SUM
  | AVERAGE
  | ...
  ;

Here the eval_expression is very complex (with features like nested eval, logic expression, etc), and I don't care the contents of eval_expression and don't want to spend too much effort parsing it. I would like to only obtain other information such as stats_function name in the above grammar.

My question is if there is any way in Jison that allows me to do some wildcard matching to match the entire eval_expression easily without writing full lexer/grammar specification for the eval_expression?

NOTE: Solution like using regular expression instead of Jison to do this job does not work for me because I need to parse other_stats_aggregation part above in the language too which I write the entire grammar/parser for it.

Any help is appreciated.


Solution

  • Assuming that that you don't need eval_expression to be fully parsed for any other purpose (i.e., it's not part of expression), then the only thing you need to know is where the expression terminates. It's probably reasonable to assume that it has balanced parentheses, so it will span any sequence of tokens whose parentheses balance, which can be recognized with something like:

     balanced_paren_sequence: 
                            | balanced_paren_sequence balanced_paren_object
                            ;
    
     /* Since jison has no wild cards, you need this complete list */
     balanced_paren_object: '(' balanced_paren_sequence ')'
                          | '+' | '-' | '*' | '/' | ...
                          | `[` | `]` | '{' | '}' | ...
                          | IDENTIFIER | CONSTANT | ...
                          ;
    

    The list of possible RHS for balanced_paren_object will include every token in your language except ( and ). As shown, it includes other balanced pairs, like [/] and {/}.

    You could force these to balance as well, by adding rules analogous to the first production for balanced_paren_object, but that is only useful to improve error reporting; as written, the parser will accept certain incorrect constructs involving unbalanced brackets, but since you are not doing detailed parsing, your parser is going to end up accepting certain incorrect constructs anyway.

    You could inline the definition of balanced_paren_object into balanced_paren_sequence (and indeed, you could use eval_expression as the name of that non-terminal if there is only one type of expression whose detailed parse tree you don't need); I wrote it as above in a vague attempt to be legible.