Search code examples
context-free-grammarlark-parser

Match substrings in lark


How can properly match substrings using Lark?

My intention (maybe it's not possible/advisable with Lark, or any CFG) is to match and parse only important parts of the string, ignoring the rest. For example:

  • From "John", I'd like to parse "John" as one_person.
  • From "Yesterday John slept", I'd like to parse "John" as one_person and ignore the rest.
  • From "John and Mary", I'd like to parse "John and Mary" as two_people.
  • From "Yesterday John and Mary kissed.", I'd like to parse "John and Mary" as two_people and ignore the rest.

Here's my code:

from lark import Lark, tree

grammar = """
    input: not_important* important not_important*

    important: one_person
        | two_people

    one_person: PERSON
    two_people: one_person conj one_person
    not_important: RANDOM_WORD

    conj: CONJ

    PERSON: "John" | "Mary"
    CONJ: "and" | "or"
    RANDOM_WORD: /\\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input', ambiguity='explicit')
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

It works:

  • When nothing surrounds the important part, e.g. "John" or "John and Mary".
  • Or when only one side of the important part has unimportant stuff, e.g. "John slept" or "John and Mary kissed".

But it doesn't work when unimportant stuff surrounds the important stuff, e.g. "Yesterday John and Mary kissed". In this example, I was hoping to get:

input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed

But I get:

_ambig
  input
    not_important   Yesterday
    important
      one_person    John
    not_important   and
    not_important   Mary
    not_important   kissed
  input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed
    not_important   John
    not_important   and

That is, not only does Lark see the input as ambiguous, but it also fails the second parse, as two terminals ("John" and "and") are consumed twice.


Solution

  • One way to solve this would be to set a higher priority for some of your rules:

    from lark import Lark, tree
    
    grammar = """
        input: not_important* important not_important* 
    
        important.2: one_person
            | two_people
    
        one_person: PERSON
        two_people.2: one_person conj one_person
        not_important: RANDOM_WORD
    
        conj.2: CONJ
    
        PERSON: "John" | "Mary"
        CONJ: "and" | "or"
        RANDOM_WORD: /\\w+/
    
        %import common.WS
        %ignore WS
    """
    
    if __name__ == '__main__':
        parser = Lark(grammar, start='input')
        tree = parser.parse('Yesterday John and Mary kissed')
        print(tree.pretty())
    

    Notice the .2 added to important, two_people and conj, this is how you set priority in lark. By default, all rules have priority 1. With the priorities set, you can remove ambiguity='explicit', as lark will be able to deal with the ambiguities correctly.

    The second approach is to use lalr as the parser. This allows you set priorities on terminals. Then we can set PERSON and CONJ's priority to 2. This will make the grammar unambiguous, which is a requirement for using lalr

    
    from lark import Lark, tree
    
    grammar = """
        input: not_important* important not_important* 
    
        important: one_person
            | two_people
    
        one_person: PERSON
        two_people: one_person conj one_person
        not_important: RANDOM_WORD
    
        conj: CONJ
    
        PERSON.2: "John" | "Mary"
        CONJ.2: "and" | "or"
        RANDOM_WORD: /\\w+/
    
        %import common.WS
        %ignore WS
    """
    
    if __name__ == '__main__':
        parser = Lark(grammar, start='input', parser="lalr")
        tree = parser.parse('Yesterday John and Mary kissed')
        print(tree.pretty())
    

    Both approaches output

    input
      not_important Yesterday
      important
        two_people
          one_person        John
          conj      and
          one_person        Mary
      not_important kissed