Parse variable number of optional parameters with pyparsing

I'm building a DSL for a message protocol used internally by our company using pyparsing. One particular problem I haven't been able to find a solution for, is a way to generate a consistent result for input strings that have optional parameters. The following is the parse rule I have constructed:

from pyparsing import *

label       = SkipTo(';' ^ LineEnd())
delim       = Char(';').suppress()
value       = (Word(nums) ^ Combine('0x' + Word(hexnums)))
descr       = QuotedString(quoteChar="'''", multiline=True) ^ SkipTo(LineEnd())

field       = Keyword('field') + label + delim + Keyword('u8') \
                + Optional(delim + Optional(value, default=None) + Optional(delim + descr, default = None), default = None)

print(field.parseString('field Field #1; u8'))
print(field.parseString('field Field #2; u8; 1'))
print(field.parseString('field Field #3; u8; 1; This is a description of the field'))
print(field.parseString('field Field #3; u8; ; This is a description of the field'))

The output of that bit of code is:

['field', ' Field #1', 'u8', None]
['field', ' Field #2', 'u8', '1', None]
['field', ' Field #3', 'u8', '1', 'This is a description of the field']
['field', ' Field #3', 'u8', None, 'This is a description of the field']

And my preferred output is:

['field', 'Field #1', 'u8', None, None]
['field', 'Field #2', 'u8', '1', None]
['field', 'Field #3', 'u8', '1', 'This is a description of the field']
['field', 'Field #3', 'u8', None, 'This is a description of the field']

The other irritant for me is that the field name starts with a space which I'd like to get rid of.

How should I construct the parse rule so that the actual output matches the preferred one?

Solution

It looks like you are off to a good start. I always recommend that people write out their parser in a non-code format, such as a Backus-Naur Form. It doesn't have to be super-rigorous, just some notation to help you think about the format you plan to parse before you actually start thinking coding thoughts.

Based on your example, I came up with this (where '|' means alternation, and '[]' means optional):

"""
BNF:
field ::= 'field' label ';' 'u8' [';' [value] [';' [description] ] ] 
label ::= all non-';' characters
value ::= integer | '0x' hex_integer
description ::= multiline string | rest of line
"""

I used your current field definition with some small adjustments. One was the addition of Empty before the label, so that the leading spaces would not be included in the parsed value for the label (this is kind of a hack, the White().suppress() in jdaz's answer is more explicit).

Then I added results names for each of the parts of your expression. I strongly recommend the use of results names, they make your post-parsing work so much easier to find and work with the individual parsed elements, by name as opposed to positional index. Results names make your parser more maintainable in the future also, if you add new elements in your parser that might shift the positions of the parsed results around.

field       = (Keyword('field') 
               + Empty() # skips whitespace
               + label("label")
               + delim
               + Keyword('u8')("type")
               + Optional(delim 
                            + Optional(value("value")) 
                            + Optional(delim 
                                       + descr("descr"))
                          )
               )

Since there were multiple optional fields, and optionals of optionals, this goes beyond what the default value of Optional can do. So the missing fields can be added using a parse action (a callback function that is called during parsing after a particular expression has been successfully parsed). The results names that we defined also make it easier to determine which fields have been provided yet or not, something that would have been a little more difficult using plain unnamed results.

def fill_in_defaults(t):
    if "descr" not in t:
        t["descr"] = None
        t.append(None)
    if "value" not in t:
        t["value"] = None
        t.insert(-1, None)
    else:
        # convert value to int or float
        if t.value.startswith("0x"):
            t["value"] = int(t.value[2:], 16)
        else:
            try:
                t["value"] = int(t.value)
            except ValueError:
                t["value"] = float(t.value)

field.addParseAction(fill_in_defaults)

Named results can be accessed using dict-style [key] notation or object-style .key attribute notation. But to assign new results names manually, you must use the dict-style form.

Lastly, I re-did your test strings using runTests, which makes it a lot easier to create multiple test cases for a parse expression:

field.runTests("""\
    field Field #1; u8
    field Field #2; u8; 1
    field Field #3; u8; 1; This is a description of the field
    field Field #4; u8; ; This is a description of the field
    field Field #5; u8; 0x1b
    """)

This gives output for each test that:

echoes the input string
dumps the parsed results as printed from results.dump()
- lists the parsed values as a list
- lists a hierarchy of named results
or if a parse error, shows a '^' at the location of the parse failure and a message for the error