Search code examples
pythonparsingpyparsing

pyparsing different results when using dump() and asXML() function


I'm having a problem with pyparsing parsed results. I have a grammar to parse an expression. In each rule in the grammar has the setResultName() function so I can easily manipulate the parsed results. But when a use dump() function to see how the result is organized it does not show all the parsed results. However, when a use asXML() function all the results are there and structured how I want.

Here is the grammar:

# Rule for any alphanumeric word
identifier = Word(alphas, alphas + nums)

# Rule for "e" in floating point numbers
e = CaselessLiteral('E')

# Rule for booleans
boolean = (Keyword('True') 
           | Keyword('False')
).setParseAction(lambda tokens: bool(tokens[0])).setResultsName("boolean")

# Rule for integer numbers
integer = Word(nums).setParseAction(lambda tokens: int(tokens[0]))

# Rule for factor operator
factor_operator = (Literal('*') 
                   | Literal('/') 
                   | Literal('%')
).setResultsName("operator")

# Rule for term operator
term_operator = (Literal('+') 
                 | Literal('-')
).setResultsName("operator")

# Rule for double numbers
double = Combine(integer +
                 Optional(Literal('.') + Optional(integer)) +
                 Optional(e + Optional(term_operator) + integer)
).setParseAction(lambda tokens: float(tokens[0])).setResultsName("double")

# Forwarding expression rule
expression = Forward()

# Rule define type of factor
factor = Group((
          Literal('(').suppress() + 
              expression.setResultsName("expression") +
          Literal(')').suppress())
          | double 
          | boolean
).setResultsName("factor")

# Rule for factors
factors = Group(ZeroOrMore(factor_operator + factor)).setResultsName("factors")

# Rule for term
term = Forward()
term << Group(factor + delimitedList(factors)).setResultsName("term")

# Rule for terms
terms = Group(ZeroOrMore(term_operator + term)).setResultsName("terms")

# Rule for expression
expression << Group(Optional(term_operator) + term + delimitedList(terms)
).setResultsName("expression")

return expression

Here is the expression I want to parse:

"(2 * 3) + 20 / 5 - 1"

here is the output from dump():

[[[[[[[2.0], ['*', [3.0]]], []]], []], ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]]]
- expression: [[[[[[2.0], ['*', [3.0]]], []]], []], ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]]
  - term: [[[[[2.0], ['*', [3.0]]], []]], []]
    - factor: [[[[2.0], ['*', [3.0]]], []]]
      - expression: [[[2.0], ['*', [3.0]]], []]
        - term: [[2.0], ['*', [3.0]]]
          - factor: [2.0]
            - double: 2.0
          - factors: ['*', [3.0]]
            - factor: [3.0]
              - double: 3.0
            - operator: *
        - terms: []
    - factors: []
  - terms: ['+', [[20.0], ['/', [5.0]]], '-', [[1.0], []]]
    - operator: -
    - term: [[1.0], []]
      - factor: [1.0]
        - double: 1.0
      - factors: []

And the output from asXML():

<expression>
  <expression>
    <term>
      <factor>
        <double>2.0</double>
      </factor>
      <factors>
        <operator>*</operator>
        <factor>
          <double>3.0</double>
        </factor>
      </factors>
    </term>
    <terms>
      <operator>-</operator>
      <term>
        <factor>
          <double>20.0</double>
        </factor>
        <factors>
          <operator>/</operator>
          <factor>
            <double>5.0</double>
          </factor>
        </factors>
      </term>
      <operator>+</operator>
      <term>
        <factor>
          <double>1.0</double>
        </factor>
        <factors>
        </factors>
      </term>
    </terms>
  </expression>
</expression>

The problem is on terms tag after the nested expression with parentheses. In xml it displays all the terms that is in it(i.e., '+', '20.0 / 5.0', '-', '1.0'), which is supposed to be a list of operator and term. When using dump() function it only display the last operator and term (i.e., '-', '1.0'). Can anyone help me understand this? Is there something that I am missing? Sorry for anything that I missed to make it clear.


Solution

  • If there is a difference between dump() and asXML(), I would more likely see it as a bug in asXML(). That method is forced to do a fair bit of "guessing" as to what is wanted, and I can easily see it guessing wrong in some circumstances.

    pyparsing's default behavior is to return all parsed tokens as a flat list of strings. It does this regardless of how a parser was built up. This is so that

    (A + B + C).parseString
    
    AA = A + B
    (AA + C).parseString
    
    and
    
    DD = B + C
    (A + DD).parseString
    

    all return the same thing.

    Let's look at a simple grammar, a multiple of name/age pairs:

    test = "Bob 10 Sue 12 Henry 7"
    

    And here is our parser:

    name = Word(alphas)
    integer = Word(nums)
    
    parser = OneOrMore(name + integer)
    
    # or you can use the new multiplication syntax
    parser = (name + integer) * (1,)
    

    With the given sample text and the above parser, this would be:

    ['Bob', '10', 'Sue', '12', 'Henry', '7']
    

    Now this isn't too hard to walk through, reading items two at a time. But if there were additional, optional fields, then things get trickier. So it is much easier to tell pyparsing that each person's name and age should be grouped together.

    parser = OneOrMore(Group(name + integer))
    

    Now we get a sublist for each person, and there is no guessing in case there might be additional options.

    [['Bob', '10'], ['Sue', '12'], ['Henry', '7']]
    

    If you add results names to the original ungrouped parser, we see this (I'm using the "new" callable syntax instead of the wordy and distracting "setResultsName" call format):

    parser = OneOrMore(name("name") + integer("age"))
    result = parser.parseString(test)
    

    Knowing what we know now about ungrouped results, if we ask for result.name, which name should we get?

    If you have a situation where there are multiple expressions that share the same results name, then you have 3 options:

    • only keep that last one parsed (the default, which is what you are seeing)

    • add grouping using the Group class so that the multiple shared results will be separated into different sub-structures

    • add listAllItems=True argument to setResultsName()

      parser = (OneOrMore(name.setResultsName("name", listAllItems=True)
                + integer.setResultsName("age", listAllItems=True)))
      

      or if using the abbreviated callable format, add '*' to the end of the results name:

      parser = OneOrMore(name("name*") + integer("age*"))
      

    Now result.name will give you all of the parsed names in a list, and the result.age will give you the corresponding ages. But for data like this, I would prefer to see the data parsed into groups.

    parser = OneOrMore(Group(name("name") + integer("age")))
    

    If you want asXML() to tag each group with the tag "person", then add that name to the Group, with a trailing '*' to catch them all.

    parser = OneOrMore(Group(name("name") + integer("age"))("person*")
    

    (This is already a long-winded answer, so I have left out the dump() and asXML() output from these tests - left as an exercise for the OP and future readers.)