Search code examples
pythonlark-parserlark

Python package Lark does not build the grammar correctly


I would need to build a Tree that would retrieve something like this using Lark package:

start
  expr
    or_expr
      and_expr
        comp_expr
          identifier    Name
          comparator    eq
          value 'Milk'
        comp_expr
          identifier    Price
          comparator    lt
          value 2.55

The grammar used is the following

from lark import Lark

odata_grammar = """
    start: expr

    expr: or_expr

    or_expr: and_expr ("or" and_expr)*
    and_expr: comp_expr ("and" comp_expr)*
    comp_expr: identifier comparator value -> comp_expr

    comparator: "eq" | "lt" | "gt" | "le" | "ge" | "ne"

    value: STRING | NUMBER
    identifier: CNAME

    STRING: /'(''|[^'])*'/
    DATE: /\d{4}-\d{2}-\d{2}/
    NUMBER: /-?\d+(\.\d+)?/

    %import common.CNAME
    %import common.WS
    %ignore WS
"""

parser = Lark(odata_grammar, start='start', parser='lalr')
url_filter = "Name eq 'Milk' and Price lt 2.55"
tree = parser.parse(url_filter)
print(tree.pretty())

When I print this tree, I find that the Tree retrieved is the following:

start
  expr
    or_expr
      and_expr
        comp_expr
          identifier    Name
          comparator
          value 'Milk'
        comp_expr
          identifier    Price
          comparator
          value 2.55

The comparator for some reason is not retrieved. And I say retrieved because the Lark package seems to detect it but it is not printed in the tree. This is curious because when I try to "force" the comparator to doing something like this in the grammar comparator: "eq" -> eq what I get is the comparator named as eq but not comparator: eq.


Solution

  • See Tree Construction section in Lark documentation: https://lark-parser.readthedocs.io/en/stable/tree_construction.html:

    " Lark filters out certain types of terminals by default, considering them punctuation:

    • Terminals that won’t appear in the tree are:

    • Unnamed literals (like "keyword" or "+")

    • Terminals whose name starts with an underscore (like _DIGIT)

    Terminals that will appear in the tree are:

    • Unnamed regular expressions (like /[0-9]/)

    • Named terminals whose name starts with a letter (like DIGIT) "

    so... option one - transform the string literals of your comparator rule into regexps:

    odata_grammar = """
        start: expr
    
        expr: or_expr
    
        or_expr: and_expr ("or" and_expr)*
        and_expr: comp_expr ("and" comp_expr)*
        comp_expr: identifier comparator value -> comp_expr
    
        comparator: /eq/ | /lt/ | /gt/ | /le/ | /ge/ | /ne/
    
        value: STRING | NUMBER
        identifier: CNAME
    
        STRING: /'(''|[^'])*'/
        DATE: /\d{4}-\d{2}-\d{2}/
        NUMBER: /-?\d+(\.\d+)?/
    
        %import common.CNAME
        %import common.WS
        %ignore WS
    

    Option two: add rules for each comparator literal:

    odata_grammar = """
        start: expr
    
        expr: or_expr
    
        or_expr: and_expr ("or" and_expr)*
        and_expr: comp_expr ("and" comp_expr)*
        comp_expr: identifier comparator value -> comp_expr
    
        comparator: eq | lt | gt | le | ge | ne
        eq: "eq"
        lt: "lt"
        gt: "gt"
        le: "le"
        ge: "ge"
        ne: "ne"
        value: STRING | NUMBER
        identifier: CNAME
    
        STRING: /'(''|[^'])*'/
        DATE: /\d{4}-\d{2}-\d{2}/
        NUMBER: /-?\d+(\.\d+)?/
    
        %import common.CNAME
        %import common.WS
        %ignore WS
    """
    

    Both solutions will capture eq into the the parse tree.