Search code examples
pythonparsingpegparsimonious

How to write a PEG parser that fully consumes any and all text whilst still matching other given rules?


I'm making an application to make writing (PEG) parsers more approachable and user friendly for people without experience. Yes it has been done before but it's a good learning experience for me regarding GUIs.

Part of what makes it approachable would be that the user doesn't need to worry about their grammar having to match the whole text, they should be able to extract meaningful data without all that "boilerplate".

How would one do this? Please see my answer below. Or provide your own.


Solution

  • This stumped me for most of an evening and couldn't find it answered already online so figured I'd share.

    MRE of what I have using the parsimonious library. It works because match will match any top level user defined expressions, and there is a fall back that matches anything else, sadly only one character at a time.

    from parsimonious.grammar import Grammar
    
    grammar = Grammar("""
    root = (match / any)*
    match = foo / bar # must include all top level user defined rules, but not their children (if any)
    any = ~"."
    foo = "foo expression" # user defined
    bar = "bar expression" # user defined
    """)
    
    print(grammar.match("1 foo expression 2 bar expression 3"))
    

    And the print out is correct.

    <Node called "root" matching "1 foo expression 2 bar expression 3">
        <Node matching "1">
            <RegexNode called "any" matching "1">
        <Node matching " ">
            <RegexNode called "any" matching " ">
        <Node matching "foo expression">
            <Node called "match" matching "foo expression">
                <Node called "foo" matching "foo expression">
        <Node matching " ">
            <RegexNode called "any" matching " ">
        <Node matching "2">
            <RegexNode called "any" matching "2">
        <Node matching " ">
            <RegexNode called "any" matching " ">
        <Node matching "bar expression">
            <Node called "match" matching "bar expression">
                <Node called "bar" matching "bar expression">
        <Node matching " ">
            <RegexNode called "any" matching " ">
        <Node matching "3">
            <RegexNode called "any" matching "3">
    

    I don't find it very elegant to be honest, especially how individual characters are matched for "any" and "root" (I'd much prefer if they were together or omitted completely), but it's the best I could do, if it's of use to anyone that's all that matters!


    From the Parsimonious readme there are examples like this.

    my_grammar = Grammar(r"""
        styled_text = bold_text / italic_text
        bold_text   = "((" text "))"
        italic_text = "''" text "''"
        text        = ~"[A-Z 0-9]*"i
        """)
    

    Which to me suggests there's a way to use this on a larger body of text (that contains text that's neither bold nor italic throughout) that I'm not aware of. Other than using the optional "pos" (position) parameter for parse/match on every position of a document, which also isn't elegant.

    I don't see how from the readme, if anyone knows the "proper" way, please share.