Search code examples
pythonlark-parser

Lark matching custom delimiter multiline strings


I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:

$my_var .= << 'anydelim';
some things
other things
anydelim

While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.

If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!


Solution

  • A solution using a regexp. Key ingredients:

    • back references, in this case named references
    • the /s modifier (causes . to also match newlines
    • .*? to match non greedy (otherwise it would also consume the delimiter)
    from lark import Lark
    
    block_grammar = r"""
        %import common.WS
        %ignore WS
        delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
    """
    minimal_parser = Lark(block_grammar, start="delimited_string")
    
    ast = minimal_parser.parse(r"""
        << 'SomeDelim'; fasdfasdf 
        fddfsdg SomeDelim
    """)
    print(ast.pretty())