I am trying to use lark to extract some information from perl files. For that, I need a basic understanding of what a statement is. The issue I came across are "Here Document" strings. I would describe them as multiline strings with custom delimiters, like:
$my_var .= << 'anydelim';
some things
other things
anydelim
While writing down this question, I figured out a solution using a regex with backreferences / named references. Since I could not find any similar question, I decided to post the question and answer it myself.
If anyone knows any other method (like a way to use back references across multiple lark rules), please let me know!
A solution using a regexp. Key ingredients:
from lark import Lark
block_grammar = r"""
%import common.WS
%ignore WS
delimited_string: "<<" /(?P<quote>['"])(?P<delimiter>[A-Za-z_]+)(?P=quote)\;.*?(?P=delimiter)/s
"""
minimal_parser = Lark(block_grammar, start="delimited_string")
ast = minimal_parser.parse(r"""
<< 'SomeDelim'; fasdfasdf
fddfsdg SomeDelim
""")
print(ast.pretty())