pyparsing different results when using dump() and asXML() function

I'm having a problem with pyparsing parsed results. I have a grammar to parse an expression. In each rule in the grammar has the setResultName() function so I can easily manipulate the parsed results. But when a use dump() function to see how the result is organized it does not show all the parsed results. However, when a use asXML() function all the results are there and structured how I want.

Here is the grammar:

# Rule for any alphanumeric word
identifier = Word(alphas, alphas + nums)

# Rule for "e" in floating point numbers
e = CaselessLiteral('E')

# Rule for booleans
boolean = (Keyword('True') 
           | Keyword('False')
).setParseAction(lambda tokens: bool(tokens[0])).setResultsName("boolean")

# Rule for integer numbers
integer = Word(nums).setParseAction(lambda tokens: int(tokens[0]))

# Rule for factor operator
factor_operator = (Literal('*') 
                   | Literal('/') 
                   | Literal('%')
).setResultsName("operator")

# Rule for term operator
term_operator = (Literal('+') 
                 | Literal('-')
).setResultsName("operator")

# Rule for double numbers
double = Combine(integer +
                 Optional(Literal('.') + Optional(integer)) +
                 Optional(e + Optional(term_operator) + integer)
).setParseAction(lambda tokens: float(tokens[0])).setResultsName("double")

# Forwarding expression rule
expression = Forward()

# Rule define type of factor
factor = Group((
          Literal('(').suppress() + 
              expression.setResultsName("expression") +
          Literal(')').suppress())
          | double 
          | boolean
).setResultsName("factor")

# Rule for factors
factors = Group(ZeroOrMore(factor_operator + factor)).setResultsName("factors")

# Rule for term
term = Forward()
term << Group(factor + delimitedList(factors)).setResultsName("term")

# Rule for terms
terms = Group(ZeroOrMore(term_operator + term)).setResultsName("terms")

# Rule for expression
expression << Group(Optional(term_operator) + term + delimitedList(terms)
).setResultsName("expression")

return expression

Here is the expression I want to parse:

"(2 * 3) + 20 / 5 - 1"

here is the output from dump():

[[[[[[[2.0], ['*', [3.0]]], []]], []], ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]]]
- expression: [[[[[[2.0], ['*', [3.0]]], []]], []], ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]]
  - term: [[[[[2.0], ['*', [3.0]]], []]], []]
    - factor: [[[[2.0], ['*', [3.0]]], []]]
      - expression: [[[2.0], ['*', [3.0]]], []]
        - term: [[2.0], ['*', [3.0]]]
          - factor: [2.0]
            - double: 2.0
          - factors: ['*', [3.0]]
            - factor: [3.0]
              - double: 3.0
            - operator: *
        - terms: []
    - factors: []
  - terms: ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]
    - operator: -
    - term: [[1.0], []]
      - factor: [1.0]
        - double: 1.0
      - factors: []

And the output from asXML():

<expression>
  <expression>
    <term>
      <factor>
        <double>2.0</double>
      </factor>
      <factors>
        <operator>*</operator>
        <factor>
          <double>3.0</double>
        </factor>
      </factors>
    </term>
    <terms>
      <operator>-</operator>
      <term>
        <factor>
          <double>20.0</double>
        </factor>
        <factors>
          <operator>/</operator>
          <factor>
            <double>5.0</double>
          </factor>
        </factors>
      </term>
      <operator>+</operator>
      <term>
        <factor>
          <double>1.0</double>
        </factor>
        <factors>
        </factors>
      </term>
    </terms>
  </expression>
</expression>

The problem is on terms tag after the nested expression with parentheses. In xml it displays all the terms that is in it(i.e., '+', '20.0 / 5.0', '-', '1.0'), which is supposed to be a list of operator and term. When using dump() function it only display the last operator and term (i.e., '-', '1.0'). Can anyone help me understand this? Is there something that I am missing? Sorry for anything that I missed to make it clear.

Solution

If there is a difference between dump() and asXML(), I would more likely see it as a bug in asXML(). That method is forced to do a fair bit of "guessing" as to what is wanted, and I can easily see it guessing wrong in some circumstances.

pyparsing's default behavior is to return all parsed tokens as a flat list of strings. It does this regardless of how a parser was built up. This is so that

(A + B + C).parseString

AA = A + B
(AA + C).parseString

and

DD = B + C
(A + DD).parseString

all return the same thing.

Let's look at a simple grammar, a multiple of name/age pairs:

test = "Bob 10 Sue 12 Henry 7"

And here is our parser:

name = Word(alphas)
integer = Word(nums)

parser = OneOrMore(name + integer)

# or you can use the new multiplication syntax
parser = (name + integer) * (1,)

With the given sample text and the above parser, this would be:

['Bob', '10', 'Sue', '12', 'Henry', '7']

Now this isn't too hard to walk through, reading items two at a time. But if there were additional, optional fields, then things get trickier. So it is much easier to tell pyparsing that each person's name and age should be grouped together.

parser = OneOrMore(Group(name + integer))

Now we get a sublist for each person, and there is no guessing in case there might be additional options.

[['Bob', '10'], ['Sue', '12'], ['Henry', '7']]

If you add results names to the original ungrouped parser, we see this (I'm using the "new" callable syntax instead of the wordy and distracting "setResultsName" call format):

parser = OneOrMore(name("name") + integer("age"))
result = parser.parseString(test)

Knowing what we know now about ungrouped results, if we ask for result.name, which name should we get?

If you have a situation where there are multiple expressions that share the same results name, then you have 3 options:

only keep that last one parsed (the default, which is what you are seeing)
add grouping using the Group class so that the multiple shared results will be separated into different sub-structures

add listAllItems=True argument to setResultsName()

parser = (OneOrMore(name.setResultsName("name", listAllItems=True)
          + integer.setResultsName("age", listAllItems=True)))

or if using the abbreviated callable format, add '*' to the end of the results name:

parser = OneOrMore(name("name*") + integer("age*"))

Now result.name will give you all of the parsed names in a list, and the result.age will give you the corresponding ages. But for data like this, I would prefer to see the data parsed into groups.

parser = OneOrMore(Group(name("name") + integer("age")))

If you want asXML() to tag each group with the tag "person", then add that name to the Group, with a trailing '*' to catch them all.

parser = OneOrMore(Group(name("name") + integer("age"))("person*")

(This is already a long-winded answer, so I have left out the dump() and asXML() output from these tests - left as an exercise for the OP and future readers.)