How can properly match substrings using Lark?
My intention (maybe it's not possible/advisable with Lark, or any CFG) is to match and parse only important parts of the string, ignoring the rest. For example:
one_person
.one_person
and ignore the rest. two_people
.two_people
and ignore the rest.Here's my code:
from lark import Lark, tree
grammar = """
input: not_important* important not_important*
important: one_person
| two_people
one_person: PERSON
two_people: one_person conj one_person
not_important: RANDOM_WORD
conj: CONJ
PERSON: "John" | "Mary"
CONJ: "and" | "or"
RANDOM_WORD: /\\w+/
%import common.WS
%ignore WS
"""
if __name__ == '__main__':
parser = Lark(grammar, start='input', ambiguity='explicit')
tree = parser.parse('Yesterday John and Mary kissed')
print(tree.pretty())
It works:
But it doesn't work when unimportant stuff surrounds the important stuff, e.g. "Yesterday John and Mary kissed". In this example, I was hoping to get:
input
not_important Yesterday
important
two_people
one_person John
conj and
one_person Mary
not_important kissed
But I get:
_ambig
input
not_important Yesterday
important
one_person John
not_important and
not_important Mary
not_important kissed
input
not_important Yesterday
important
two_people
one_person John
conj and
one_person Mary
not_important kissed
not_important John
not_important and
That is, not only does Lark see the input as ambiguous, but it also fails the second parse, as two terminals ("John" and "and") are consumed twice.
One way to solve this would be to set a higher priority for some of your rules:
from lark import Lark, tree
grammar = """
input: not_important* important not_important*
important.2: one_person
| two_people
one_person: PERSON
two_people.2: one_person conj one_person
not_important: RANDOM_WORD
conj.2: CONJ
PERSON: "John" | "Mary"
CONJ: "and" | "or"
RANDOM_WORD: /\\w+/
%import common.WS
%ignore WS
"""
if __name__ == '__main__':
parser = Lark(grammar, start='input')
tree = parser.parse('Yesterday John and Mary kissed')
print(tree.pretty())
Notice the .2
added to important
, two_people
and conj
, this is how you set priority in lark. By default, all rules have priority 1.
With the priorities set, you can remove ambiguity='explicit'
, as lark will be able to deal with the ambiguities correctly.
The second approach is to use lalr
as the parser. This allows you set priorities on terminals. Then we can set PERSON
and CONJ
's priority to 2. This will make the grammar unambiguous, which is a requirement for using lalr
from lark import Lark, tree
grammar = """
input: not_important* important not_important*
important: one_person
| two_people
one_person: PERSON
two_people: one_person conj one_person
not_important: RANDOM_WORD
conj: CONJ
PERSON.2: "John" | "Mary"
CONJ.2: "and" | "or"
RANDOM_WORD: /\\w+/
%import common.WS
%ignore WS
"""
if __name__ == '__main__':
parser = Lark(grammar, start='input', parser="lalr")
tree = parser.parse('Yesterday John and Mary kissed')
print(tree.pretty())
Both approaches output
input
not_important Yesterday
important
two_people
one_person John
conj and
one_person Mary
not_important kissed