Search code examples
parsingtext-parsinglark-parser

Lark : how to pick only some patterns


I would like to extract from a text file only some structured patterns.

example, in the text below:

   blablabla 
   foo FUNC1 ; blabliblo blu

I would like to isolate only 'foo FUNC1 ;'.

I was trying to use lark parser with the following parser

foo=Lark('''
  start:  statement*
  statement: foo 
           | anything
  anything : /.+/
  foo : "foo" ID ";"
  ID : /_?[a-z][_a-z0-9]*/i
  %import common.WS
  %import common.NEWLINE
  %ignore WS
  %ignore NEWLINE
''',
parser="lalr" ,
propagate_positions=True)

But the token 'anything' captures all. Is there a way to make it not greedy ? So that the token 'foo' can capture the given pattern ?


Solution

  • You could solve this with priorities.

    For parser="lalr", Lark supports priorities on terminals. So you could move "foo" into its own terminal and then assign that terminal a higher priority than the anything terminal (which has default priority 1):

      foo : FOO ID ";"
      FOO.2: "foo"
    

    Parsing your example string then results in:

    start
      statement
        anything    blablabla 
      statement
        foo
          foo
          FUNC1
      statement
        anything    blabliblo blu
    

    For parser="earley", Lark supports priorities on rules, so you could use:

      foo.2 : "foo" ID ";"
    

    Parsing your example string then results in:

    start
      statement
        anything    blablabla 
      statement
        foo FUNC1
      statement
        anything     blabliblo blu