Search code examples
pythonpyparsing

PyParsing - Grammar Elements Split Around Other Elements


I'm moving a tool (not written by me) to use PyParsing. I'm updating the grammar to make more sense, but would also like to be backwards-compatible. The syntax includes elements which are split by another element, and I need both, the "verb" (wrapping element) and the "value" (wrapped element). (No, these aren't globs, although they look like them - it's confusing, and part of the reason why I'm changing it).

*thing*  # verb: contains      value: thing
*thing   # verb: starts_with   value: thing
thing*   # verb: ends_with     value: thing
!*thing* # verb: not_contains  value: thing

I'm having a hard time wrapping my head around how to parse something like *thing* where the "verb" is wrapped around the "value". These elements are also in a delimited list, although that part I'm fine with.

An complete example of what I would be parsing:

command *thing*, !*other_thing*, *third_thing

What I've tried:

import pyparsing as pp

command = pp.Keyword("command").setResultsName("command")
value = pp.Word(pp.alphanums + "_").setResultsName("value", listAllMatches=True)

contains = ("*" + value + "*").setResultsName("verb", listAllMatches=True)
not_contains = ("!*" + value + "*").setResultsName("verb", listAllMatches=True)
starts_with = ("*" + value).setResultsName("verb", listAllMatches=True)

verbs_and_values = (
    contains
    | not_contains
    | starts_with
)

directive = pp.Group(command + pp.delimitedList(verbs_and_values, delim=","))

example = "command *thing*, !*other_thing*, *third_thing"

result = directive.parseString(example)
print result.dump()

This gets me all the values, but the verbs are the whole thing (i.e. ['*', 'thing', '*']). I tried adjusting the verbs with a parseAction similar to this:

def process_verb(tokens):
    if tokens[0] == '*' and tokens[-1] == '*':
        return "contains"
    # handle other verbs...

Which works fine, but it blows away the values...


Solution

  • I see that you are using results names with listAllMatches=True to capture multiple parsed values in a delimitedList. This is okay for simple data structures, but once you want to store multiple values for a given value, then you will need to start using Group or parse action classes.

    As a general practice, I avoid using results names on low-level expressions, and instead add them when composing higher-level expressions with '+' and '|' operators. I also mostly use the expr("name") form rather than the expr.setResultsName("name") form for setting results names.

    Here is a modified version of your code using Groups:

    command = pp.Keyword("command")
    value = pp.Word(pp.alphanums + "_")
    
    contains = pp.Group("*" + value("value") + "*")
    not_contains = pp.Group("!*" + value("value") + "*")
    starts_with = pp.Group("*" + value("value"))
    

    I also added a results names for command and the list of verbs in directive:

    directive = pp.Group(command("command")
                         + pp.Group(pp.delimitedList(verbs_and_values, 
                                            delim=","))("verbs"))
    

    Now that these expressions are wrapped in Groups, it is not necessary to use listAllMatches=True, since each value is now kept in its own separate group.

    The parsed results now look like this:

    [['command', ['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]]
    [0]:
      ['command', ['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]
      - command: 'command'
      - verbs: [['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]
        [0]:
          ['*', 'thing', '*']
          - value: 'thing'
        [1]:
          ['!*', 'other_thing', '*']
          - value: 'other_thing'
        [2]:
          ['*', 'third_thing']
          - value: 'third_thing'
      
    

    You were on the right track of using a parse action to add information about the type of verb, but instead of returning that value, you want to add the type of the verb as another named result.

    def add_type_parse_action(verb_type):
        def pa(s, l, t):
            t[0]["type"] = verb_type
        return pa
    
    contains.addParseAction(add_type_parse_action("contains"))
    not_contains.addParseAction(add_type_parse_action("not_contains"))
    starts_with.addParseAction(add_type_parse_action("starts_with"))
    

    After adding the parse actions, you get these results:

    [['command', ['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]]
    [0]:
      ['command', ['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]
      - command: 'command'
      - verbs: [['*', 'thing', '*'], ['!*', 'other_thing', '*'], ['*', 'third_thing']]
        [0]:
          ['*', 'thing', '*']
          - type: 'contains'
          - value: 'thing'
        [1]:
          ['!*', 'other_thing', '*']
          - type: 'not_contains'
          - value: 'other_thing'
        [2]:
          ['*', 'third_thing']
          - type: 'starts_with'
          - value: 'third_thing'
          
    

    You can also define classes to give structure to your results. Since the class is "called" as if it were a parse action, Python will construct a class instance using the parsed tokens:

    class VerbBase:
        def __init__(self, tokens):
            self.tokens = tokens[0]
    
        @property
        def value(self):
            return self.tokens.value
        
        def __repr__(self):
            return "{}(value={!r})".format(type(self).__name__, self.value)
    
    class Contains(VerbBase): pass
    class NotContains(VerbBase): pass
    class StartsWith(VerbBase): pass
    
    contains.addParseAction(Contains)
    not_contains.addParseAction(NotContains)
    starts_with.addParseAction(StartsWith)
    
    result = directive.parseString(example)
    print(result.dump())
    

    Now the results are in object instances, whose types indicate what kind of verb was used:

    [['command', [Contains(value='thing'), NotContains(value='other_thing'), StartsWith(value='third_thing')]]]
    [0]:
      ['command', [Contains(value='thing'), NotContains(value='other_thing'), StartsWith(value='third_thing')]]
      - command: 'command'
      - verbs: [Contains(value='thing'), NotContains(value='other_thing'), StartsWith(value='third_thing')]
    

    Note: Throughout your question, you refer to the items after the command as "verbs", and I have retained that name in this answer to make it easier to compare with your initial attempts. But usually, "verb" will refer to some action, like the "command" in your directive, and the following items are more like "qualifiers" or "arguments". Names are important when coding, not just when communicating with others, but even when forming your own mental concepts of what your code is doing. To me, the "verb" here, which is usually the action in a sentence, is more like the command, while the following parts I would call "qualifiers", "arguments", or "subjects".