Search code examples
pythonparsingpyparsing

Why are pyparsing's `DelimitedList` and `Dict` so awkward to use together?


Pyparsing offers the ParseElementEnhance subclass DelimitedList for parsing (typically comma-separated) lists:

>>> kv_element = pp.Word(pp.alphanums)
>>> kv_list = pp.DelimitedList(kv_element)
>>> kv_list.parse_string('red, green, blue')
ParseResults(['red', 'green', 'blue'], {})

And it provides the TokenConverter subclass Dict, for transforming a repeating expression into a dictionary:

>>> key = value = pp.Word(pp.alphanums)
>>> kv_pair = key + pp.Suppress("=") + value
>>> kv_dict = pp.Dict(pp.Group(kv_pair)[...])
>>> kv_dict.parse_string('R=red G=green B=blue')
ParseResults([
  ParseResults(['R', 'red'], {}),
  ParseResults(['G', 'green'], {}),
  ParseResults(['B', 'blue'], {})
], {'R': 'red', 'G': 'green', 'B': 'blue'})

But combining them feels awkward. It's possible to build a successful combined ParserElement for parsing a dict out of a delimited list, but compared to the above it requires:

  1. Redefining the DelimitedList to output Group()s
  2. Repeating the DelimitedList when constructing the Dict() around it, to appease the type checker.1
>>> kv_pair = key + pp.Suppress("=") + value
>>> kv_pairlist = pp.DelimitedList(pp.Group(kv_pair))
>>> kv_pairdict = pp.Dict(kv_pairlist[...])
>>> kv_pairdict.parse_string('R=red, G=green, B=blue')
ParseResults([
  ParseResults(['R', 'red'], {}), 
  ParseResults(['G', 'green'], {}), 
  ParseResults(['B', 'blue'], {})
], {'R': 'red', 'G': 'green', 'B': 'blue'})

The whole effect reads like you're defining a parser to create a dictionary from a series of 1-element delimited lists, each containing a single key-value pair match. (In fact, I'm not entirely sure that isn't what's actually happening in the parser.)

Writing code to express the intent — a parser definition to match a single delimited list, containing a series of key-value pair matches — feels like a struggle against the API. (The fact that using kv_pairdict = pp.Dict(kv_pairlist) will function the same as above, but runs afoul of the type checker, is especially vexing.)

Is there a cleaner way to express the intended parser definition, within the Pyparsing API? If not, is that a deficiency of my design, of Pyparsing's API, or something else?

(Do I have the definition inside out? DelimitedList(Dict(Group(kv_pair)[1, ...])) does also work, but feels even more conceptually backwards to me. But it doesn't involve nearly as much fighting against the API, so maybe I'm just looking at it wrong.)

Notes

  1. (Otherwise, at least in VSCode, it gets this vaguely insane-sounding annotation:)

    No overload variant of "dict" matches argument type "DelimitedList" (mypycall-overload)

    Possible overload variants:

        def [_KT, _VT] __init__(self) -> dict[_KT, _VT]
        def [_KT, _VT] __init__(self, **kwargs: _VT) -> dict[str, _VT]
        def [_KT, _VT] __init__(self, SupportsKeysAndGetItem[_KT, _VT], /) -> dict[_KT, _VT]
        def [_KT, _VT] __init__(self, SupportsKeysAndGetItem[str, _VT], /, **kwargs: _VT) -> dict[str, _VT]
        def [_KT, _VT] __init__(self, Iterable[tuple[_KT, _VT]], /) -> dict[_KT, _VT]
        def [_KT, _VT] __init__(self, Iterable[tuple[str, _VT]], /, **kwargs: _VT) -> dict[str, _VT]
        def [_KT, _VT] __init__(self, Iterable[list[str]], /) -> dict[str, str]
        def [_KT, _VT] __init__(self, Iterable[list[bytes]], /) -> dict[bytes, bytes]mypy(note)
    

Solution

  • Dict is to be constructed using a single ParserElement that represents repetition of Group'ed ParserElements, taking the text matched in the 0'th element of each Group as that Group's key, and the remainder of the Group as the corresponding value. Typically, the repetition is done using OneOrMore or ZeroOrMore (or their new slice-ilike notations [1, ...] or [...]). But it is perfectly suitable to use DelimitedList for this repetition, as long as the expression used for the repeated key-value pairs is a Group. See if this slight reworking of your code helps (I really just moved the Group up to the kv_pair definition):

    key = pp.common.identifier  # keep the keys usable for use as attribute names
    value = pp.Word(pp.alphanums)
    kv_pair = pp.Group(key + pp.Suppress("=") + value)
    kv_pairlist = pp.DelimitedList(kv_pair)
    kv_pairdict = pp.Dict(kv_pairlist)
    
    
    kv_pairdict.run_tests("""\
        R=red, G=green, B=blue
    """
    )
    

    I saved this as dict_of_delimited_list.py, and adding these lines, I get dict_of_delimited_list_diagram.html containing this railroad diagram of your parser.

    pp.autoname_elements()
    kv_pairdict.create_diagram(f"{__file__.removesuffix('.py')}_diagram.html")
    

    Parser railroad diagram

    As for your note, I strongly suspect a problem in/with the type checker. The __init__ signature for Dict clearly takes a single ParserElement, not a key type and value type, so I suspect the type checker is seeing "pp.Dict" and thinking "typing.Dict". I confirmed this using a modified version of pyparsing that renames Dict to DictOf, and made no other changes, and your insane-sounding type suggestion was resolved. (Unfortunately, just doing from pyparsing import Dict as DictOf was not sufficient.) As a side note, I'd like to mention that pyparsing's Dict class predates typing.Dict by about 15 years - pyparsing had Dict first!