Search code examples
pythonparsinglalrlark-parser

Grammar rules for parsing optional keywords from list of words with LALR


I have strings with words like this:

  • "ABC Some other stuff" (normaly there is a short letter combination at the beginning)
  • "Some other stuff" (sometimes there is nothing interesting)
  • "HFG 54 Some other stuff and even more" (sometimes there is an interesting number)
  • "HFG 54 ZZ Some other stuff and even more" (sometimes there is ZZ after the number)
  • "HFG-54 ZZ Some other stuff and even more" (soemtimes there is a dash)
  • "ZT SOME OTHER STUFF" (The rest can be capitalized too)
  • "(ZT) Some other stuff" (The part can be in brackets)
  • "68 Some other stuff " (There can only be a number)
  • "Some other stuff DFG" (can be at the end)

I made some rules to parse it and with larks Early-parser it works fine. Now i want to try it out with the Lalr-Parser but then the special words are not recognized. I am interested in the capitalized letter combinations and the numbers. I have a list of the possible letter combinations. The numbers are always two digits. The rest of the string can be anything.

I am using lark. This is my parser:

from lark import Lark

tryoutabc = Lark(r"""
    start: mat_all_and_restl
    ?mat_all_and_restl: mat_all restl | restl | restl mat_br
    restl: REST*
    ?mat_all: mat_br| mat_num| num
    mat_num: mat_br ("-")? NUM
    ?num: NUM  ("ZZ")?
    NUM: /\d{2}/
    ?mat_br:  "("? MAT ")"?
    MAT:  "ABC" 
        | "HFG" 
        | "ZT" 
        | "DFG" 
    REST: /([\-])?[A-Za-zÜ]+([\/.][A-Za-zÜ]+)?/
    %import common.WS
    %ignore WS
""", start='start', parser='lalr')   #remove "parser='lalr'" and the results are right

How do i have to change the rules to make it possible with the lalr-parser to parse this?

To try it out:

textl = ["ABC SOME OTHER STUFF", "Some other stuff", "HFG 54 Some other stuff and even     more",
         "HFG 54 ZZ Some other stuff and even more", "HFG-54 Some other stuff and even more", "ZT Some other stuff",
         "(ZT) Some other stuff", "Some other stuff DFG", "Some other stuff (DFG)", "68 Some other stuff "]
for text in textl:
    print(tryoutabc.parse(text).pretty())

I get

start
  restl
    ABC
    SOME
    OTHER
    STUFF

but i want

start
  ABC
  restl
    SOME
    OTHER
    STUFF

For "HFG 54 Some other stuff and even more" i get an error:

HFG 54 Some other stuff and even more
    ^

Expecting: {'REST'}`

but i want:

start
  mat_num
    HFG
    54
  restl
    Some
    other
    stuff
    and
    even
    more

In reality the strings are longer and look like this "Interesting stuff i already parsed ABC Some other stuff" and i already parsed the stuff at the beginning of the strings and it works pefect.


From the comments it appears that it is not possible to do that because i dont have a context free language here and apparently larl can only do cfg-languages. If somebody adds and answer where this both is explained quick i would be happy.


Solution

  • The problem is the REST terminal. It can parse almost anything by design, meaning that

    
    start
        mat_all_and_restl
          ABC
          restl
            SOME
            OTHER
            STUFF
    

    and

    start
        restl
          ABC
          SOME
          OTHER
          STUFF
    

    Are both valid interpretations of the first example sentences (ABC SOME OTHER STUFF). This is something you can take a look at if you pass ambiguity='explicit' to an earley parser.

    This example works with lalr, since there isn't a problem if lalr just interprets everything as REST, but it does give you the wrong result.

    The problematic example "HFG 54 Some other stuff and even more" on the other hand doesn't work. It takes HFG as REST, and then tries to parse 54 which it can since it isn't REST and it doesn't expect a NUM right now.


    If this is the full extended of you grammar, use earley, or for even more speed than lalr could provided, a (or multiple) regex might also do the job quite well.

    (Note, I misspoke: this is potentially a CFG, just ambiguous. You could probably continue to hack on the grammar/the REST regex to make it work, but it won't be pretty.)