pyparsing provides a helper function, delimitedList, that matches a sequence of one or more expressions, separated with a delimiter:
delimitedList(expr, delim=',', combine=False)
How can this be used to match a sequence of expressions, where each expression may occur zero or one times?
For example, to match "foo", "bar, "baz"
I took a bottom-up approach a created a token for each word:
import pyparsing as pp
dbl_quote = pp.Suppress('"')
foo = dbl_quote + pp.Literal('foo') + dbl_quote
bar = dbl_quote + pp.Literal('bar') + dbl_quote
baz = dbl_quote + pp.Literal('baz') + dbl_quote
I want to create an expression that matches:
zero or one occurrences of
"foo"
, zero or one occurrences of"bar"
, zero or one occurrences of"baz"
... in any order. Examples of valid input:
"foo", "bar", "baz"
"baz", "bar", "foo",
// Order is unimportant"bar", "baz"
// Zero occurrences allowed"baz"
// Zero occurrences of all tokensExamples of invalid input:
"notfoo", "notbar", "notbaz"
"foo", "foo", "bar", "baz"
// Two occurrences of foo
"foo" "bar", "baz"
// Missing comma"foo" "bar", "baz",
// Trailing commaI gravitated towards delimitedList because my input is a comma delimited list, but now I feel this function is working against me rather than for me.
import pyparsing as pp
dbl_quote = pp.Suppress('"')
foo = dbl_quote + pp.Literal('foo') + dbl_quote
bar = dbl_quote + pp.Literal('bar') + dbl_quote
baz = dbl_quote + pp.Literal('baz') + dbl_quote
# This is NOT what I want because it allows tokens
# to occur more than once.
foobarbaz = pp.delimitedList(foo | bar | baz)
if __name__ == "__main__":
TEST = '"foo", "bar", "baz"'
results = foobarbaz.parseString(TEST)
results.pprint()
Ordinarily, when I see "in any order" as part of a grammar, my first thought is to use Each
, which you can create with the &
operator:
undelimited_foo_bar_baz = foo & bar & baz
This parser would parse foo
, bar
, and baz
in any order. If you wanted them to be optional, then simply wrap them in Optional:
undelimited_foo_bar_baz = Optional(foo) & Optional(bar) & Optional(baz)
But the intervening commas in your input make this kind of messy, so as a fallback, you can stick with the delimitedList
(which will strip out the commas) add a condition parse action to get run after the list is parsed, to verify that only one of each of the matched items was present:
from collections import Counter
def no_more_than_one_of_any(t):
return all(freq == 1 for freq in Counter(t.asList()).values())
foobarbaz.addCondition(no_more_than_one_of_any, message="duplicate item found in list")
if __name__ == "__main__":
tests = '''\
"foo"
"bar"
"baz"
"foo", "baz"
"foo", "bar", "baz"
"foo", "bar", "baz", "foo"
'''
foobarbaz.runTests(tests)
Prints:
"foo"
['foo']
"bar"
['bar']
"baz"
['baz']
"foo", "baz"
['foo', 'baz']
"foo", "bar", "baz"
['foo', 'bar', 'baz']
"foo", "bar", "baz", "foo"
^
FAIL: duplicate item found in list, found '"' (at char 0), (line:1, col:1)