Search code examples
xmlpyparsing

Tag error when PyParsing using setResultsName() and asXML()


I want to use PyParsing to parse text and output as XML (asXML()). But the tags in XML output is inconsistent with setResultsName.

Please see the following code segments:

p1 = (Literal('a').setResultsName('tag_a')).setResultsName('tag_out')
print(p1.parseString('a').asXML())
# Output:
# <tag_out>
#   <tag_out>a</tag_out>
# </tag_out>

p2 = (Literal('a').setResultsName('tag_a') +
      Literal('b').setResultsName('tag_b')).setResultsName('tag_out')
print((p2.parseString('a b').asXML()))

# The result is randomly chosen from these two outputs.
# <tag_out>
#   <tag_a>a</tag_a>
#   <tag_b>b</tag_b>
# </tag_out>
#
# <tag_out>
#   <tag_out>a</tag_out>
#   <tag_b>b</tag_b>
# </tag_out>

Note that the tag of first inner element is often wrong.

Is this a known bug of PyParsing? What are the patch/workaround for this?


Solution

  • Pyparsing does not automatically impart structure to the expressions in your grammar based on your code. This is by design, so that partial expressions can be easily merged together, so that:

    grammar = exprA + exprB + exprC
    

    and

    grammar = exprA + (exprB + exprC)
    

    and

    tmp = exprA + exprB
    grammar = tmp + exprC
    

    behave the same way. So just putting ()'s around an expression does not automatically define another level of your grammar.

    Pyparsing provides the Group class for what you want. Your results will improve greatly if you change your code to:

    p1 = Group(Literal('a').setResultsName('tag_a')).setResultsName('tag_out')
    print(p1.parseString('a').asXML())
    
    p2 = Group(Literal('a').setResultsName('tag_a') +
          Literal('b').setResultsName('tag_b')).setResultsName('tag_out')
    print((p2.parseString('a b').asXML()))
    

    That being said, asXML() is not the greatest part of pyparsing, as it has to make guesses under certain circumstances when walking a structure and creating the output tags. If this is just for debugging purposes, I recommend using the dump() method instead.

    I also recommend that you switch to the implicit callable form of setResultsName - I think it streamlines your grammar code without hurting readability. See the difference:

    p1 = Group(Literal('a')('tag_a'))('tag_out')
    print(p1.parseString('a').asXML())
    
    p2 = Group(Literal('a')('tag_a') + Literal('b')('tag_b'))('tag_out')
    print((p2.parseString('a b').asXML()))