Search code examples
pythonpyparsing

Setting the maximum number occurrences with `delimitedList` using pyparsing


pyparsing provides a helper function, delimitedList, that matches a sequence of one or more expressions, separated with a delimiter:

delimitedList(expr, delim=',', combine=False)

How can this be used to match a sequence of expressions, where each expression may occur zero or one times?

For example, to match "foo", "bar, "baz" I took a bottom-up approach a created a token for each word:

import pyparsing as pp

dbl_quote = pp.Suppress('"')

foo = dbl_quote + pp.Literal('foo') + dbl_quote
bar = dbl_quote + pp.Literal('bar') + dbl_quote
baz = dbl_quote + pp.Literal('baz') + dbl_quote

I want to create an expression that matches:

zero or one occurrences of "foo", zero or one occurrences of "bar", zero or one occurrences of "baz"

... in any order. Examples of valid input:

  • "foo", "bar", "baz"
  • "baz", "bar", "foo", // Order is unimportant
  • "bar", "baz" // Zero occurrences allowed
  • "baz"
  • // Zero occurrences of all tokens

Examples of invalid input:

  • "notfoo", "notbar", "notbaz"
  • "foo", "foo", "bar", "baz" // Two occurrences of foo
  • "foo" "bar", "baz" // Missing comma
  • "foo" "bar", "baz", // Trailing comma

I gravitated towards delimitedList because my input is a comma delimited list, but now I feel this function is working against me rather than for me.

import pyparsing as pp

dbl_quote = pp.Suppress('"')

foo = dbl_quote + pp.Literal('foo') + dbl_quote
bar = dbl_quote + pp.Literal('bar') + dbl_quote
baz = dbl_quote + pp.Literal('baz') + dbl_quote



# This is NOT what I want because it allows tokens
# to occur more than once.
foobarbaz = pp.delimitedList(foo | bar | baz)



if __name__ == "__main__":
    TEST = '"foo", "bar", "baz"'
    results = foobarbaz.parseString(TEST)
    results.pprint()


Solution

  • Ordinarily, when I see "in any order" as part of a grammar, my first thought is to use Each, which you can create with the & operator:

    undelimited_foo_bar_baz = foo & bar & baz
    

    This parser would parse foo, bar, and baz in any order. If you wanted them to be optional, then simply wrap them in Optional:

    undelimited_foo_bar_baz = Optional(foo) & Optional(bar) & Optional(baz)
    

    But the intervening commas in your input make this kind of messy, so as a fallback, you can stick with the delimitedList (which will strip out the commas) add a condition parse action to get run after the list is parsed, to verify that only one of each of the matched items was present:

    from collections import Counter
    def no_more_than_one_of_any(t):
        return all(freq == 1 for freq in Counter(t.asList()).values())
    foobarbaz.addCondition(no_more_than_one_of_any, message="duplicate item found in list")
    
    if __name__ == "__main__":
        tests = '''\
        "foo"
        "bar"
        "baz"
        "foo", "baz"
        "foo", "bar", "baz"
        "foo", "bar", "baz", "foo"
        '''
        foobarbaz.runTests(tests)
    

    Prints:

    "foo"
    ['foo']
    
    "bar"
    ['bar']
    
    "baz"
    ['baz']
    
    "foo", "baz"
    ['foo', 'baz']
    
    "foo", "bar", "baz"
    ['foo', 'bar', 'baz']
    
    "foo", "bar", "baz", "foo"
    ^
    FAIL: duplicate item found in list, found '"'  (at char 0), (line:1, col:1)