Search code examples
pythonregexparsingebnflark-parser

How to parse ~{expr} inside string with lark ebnf


I am trying to write a lark grammar for a dsl, but having trouble with this string interpolation syntax:

" abc " <- normal string
" xyz~{expression}abc " <- string with interpolation

so a ~{ switches from string to expression, and a } terminates that expression. I think this is close:

string : "\"" (string_interp|not_string_interp)* "\""
string_interp: "~{" expression "}"
not_string_interp: /([^~][^{])+/

But the regex will only match even numbers of characters, and if the ~{ straddles an even boundary, it will be missed.

not_string_interp: /(.?|([^~][^{])+)/

This is about as far as I could get, but still seems wrong. Can I use lookaheads? I also want to keep %ignore WS on, as it keeps the noise down massively, so a solution accounting for that would be great!

Thanks

Test cases:

""
"a"
"~{1}"
" ~{1} "
"a bc~{1}c d"
"a b~{1}c d"

Solution

  • I think this does it. Sadly any ~ not followed by { will split the string up, but I can reconstruct them later. I am getting fooled by the equal precedence of rules, and the greediness of regexes.

    /[^"~]+/ anything that is not ~ or " (regular string)

    "~{" expression "}" the normal expression

    /~(?!{)/ handle ~ without {. Use ?! because we must not consume next char (it could be " or another ~)

    from lark import Lark
    
    print (Lark(r"""
        string: "\"" string_thing* "\""
        string_thing: /[^"~]+/
            | "~{" expression "}"
            | /~(?!{)/
        expression: /[^}]+/
    """, start='string', ambiguity="explicit").parse(
    # '"a"'
    '"a~b{}c}d~{1}g"'
    # '"~abc~"'
    # '"~{1}~~{1}~~~{1}"'
    ).pretty())