Pyparsing finding more matches than expected

I'm writing code to parse lines of basic computer instructions. My input string is something like this ADD(input1,input2) DEL(input3), SUB(input1,input2) INS(input3)

and I'm expecting a result like:

<line>
  <instruction>
    <type>ADD</type>
    <args>
      <ITEM>input1</ITEM>
      <ITEM>input2</ITEM>
    </args>
  </instruction>
  <instruction>
    <type>DEL</type>
    <args>
      <ITEM>input3</ITEM>
    </args>
  </instruction>
</line>
<line>
  <instruction>
    <type>SUB</type>
    <args>
      <ITEM>input1</ITEM>
      <ITEM>input2</ITEM>
    </args>
  </instruction>
  <instruction>
    <type>INS</type>
    <args>
      <ITEM>input3</ITEM>
    </args>
  </instruction>
</line>

My actual result has the general structure of what I'm looking for, however the line and instruction parser seem to be matching in the wrong place, or perhaps the labels are appearing in the wrong place.

Actual Result:

<line>
  <line>
    <instruction>
      <type>ADD</type>
      <args>
        <ITEM>input1</ITEM>
        <ITEM>input2</ITEM>
      </args>
    </instruction>
    <instruction>
      <type>DEL</type>
      <args>
        <ITEM>input3</ITEM>
      </args>
    </instruction>
  </line>
  <instruction>
    <instruction>
      <type>SUB</type>
      <args>
        <ITEM>input1</ITEM>
        <ITEM>input2</ITEM>
      </args>
    </instruction>
    <instruction>
      <type>INS</type>
      <args>
        <ITEM>input3</ITEM>
      </args>
    </instruction>
  </instruction>
</line>

Dump of Results

[[['OTE', ['output1']]], [['XIO', ['input2']], ['OTE', ['output2']]]]
- branch: [[['OTE', ['output1']]], [['XIO', ['input2']], ['OTE', ['output2']]]]
  [0]:
    [['OTE', ['output1']]]
    - instruction: ['OTE', ['output1']]
      - args: ['output1']
      - type: 'OTE'
  [1]:
    [['XIO', ['input2']], ['OTE', ['output2']]]
    - instruction: ['OTE', ['output2']]
      - args: ['output2']
      - type: 'OTE'

For some reason, line is matching over the entire structure, and the second line of instructions is matching as a single instruction group. I've tried to use the .setDebug() function on the instruction line, however I'm not sure how to interpret the output. I don't see why the last line should match as an instruction because it doesn't follow the Word(Word) pattern.

My Code:

#!python3
from pyparsing import nestedExpr,alphas,Word,Literal,OneOrMore,alphanums,delimitedList,Group,Forward

theInput = r"ADD(input1,input2) DEL(input3), SUB(input1,input2) INS(input3)"

instructionType = Word(alphanums+"_")("type")
argument = Word(alphanums+"_[].")
arguments = Group(delimitedList(argument))("args")
instruction = Group(instructionType + Literal("(").suppress() + arguments + Literal(")").suppress())("instruction")

line = (delimitedList(Group(OneOrMore(instruction))))("line")

parsedInput = line.parseString(theInput).asXML()
print(parsedInput)

Debug Output:

Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 0(1,1)
Matched Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) -> [['ADD', ['input1', 'input2']]]
Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 18(1,19)
Matched Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) -> [['DEL', ['input3']]]
Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 30(1,31)
Exception raised:Expected W:(ABCD...) (at char 30), (line:1, col:31)
Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 32(1,33)
Matched Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) -> [['SUB', ['input1', 'input2']]]
Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 50(1,51)
Matched Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) -> [['INS', ['input3']]]
Match Group:({W:(ABCD...) Suppress:("(") Group:(W:(ABCD...) [, W:(ABCD...)]...) Suppress:(")")}) at loc 62(1,63)
Exception raised:Expected W:(ABCD...) (at char 62), (line:1, col:63)

What am I doing wrong?

Solution

Your dump output for the code you posted looks like this:

ADD(input1,input2) DEL(input3), SUB(input1,input2) INS(input3)

[[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
- line: [[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
  [0]:
    [['ADD', ['input1', 'input2']], ['DEL', ['input3']]]
    - instruction: ['DEL', ['input3']]
      - args: ['input3']
      - type: 'DEL'
  [1]:
    [['SUB', ['input1', 'input2']], ['INS', ['input3']]]
    - instruction: ['INS', ['input3']]
      - args: ['input3']
      - type: 'INS'

We can see in the dump() output that all the instructions are parsed, but only the last instruction in each group shows up under the "instruction" name. This happens because, like a Python dict, when multiple values (as you might get in a ZeroOrMore or OneOrMore) get assigned with the same key, only the last value is retained.

There are two solutions. One is to remove the ("instruction") results name so that you just get the parsed instructions in each sublist:

[[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
- line: [[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
  [0]:
    [['ADD', ['input1', 'input2']], ['DEL', ['input3']]]
    [0]:
      ['ADD', ['input1', 'input2']]
      - args: ['input1', 'input2']
      - type: 'ADD'
    [1]:
      ['DEL', ['input3']]
      - args: ['input3']
      - type: 'DEL'
  [1]:
    [['SUB', ['input1', 'input2']], ['INS', ['input3']]]
    [0]:
      ['SUB', ['input1', 'input2']]
      - args: ['input1', 'input2']
      - type: 'SUB'
    [1]:
      ['INS', ['input3']]
      - args: ['input3']
      - type: 'INS'

There are also times in pyparsing when multiple values should be saved for a given name. The setResultsName() method has an optional argument listAllMatcheswhich enables this behavior. When using the callable shortcut for setResultsName, you can't pass listAllMatches=True - instead, end your results name with '*':

instruction = Group(instructionType 
                                + Literal("(").suppress() 
                                + arguments 
                                + Literal(")").suppress())("instruction*")

Which gives this output:

[[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
- line: [[['ADD', ['input1', 'input2']], ['DEL', ['input3']]], [['SUB', ['input1', 'input2']], ['INS', ['input3']]]]
  [0]:
    [['ADD', ['input1', 'input2']], ['DEL', ['input3']]]
    - instruction: [['ADD', ['input1', 'input2']], ['DEL', ['input3']]]
      [0]:
        ['ADD', ['input1', 'input2']]
        - args: ['input1', 'input2']
        - type: 'ADD'
      [1]:
        ['DEL', ['input3']]
        - args: ['input3']
        - type: 'DEL'
  [1]:
    [['SUB', ['input1', 'input2']], ['INS', ['input3']]]
    - instruction: [['SUB', ['input1', 'input2']], ['INS', ['input3']]]
      [0]:
        ['SUB', ['input1', 'input2']]
        - args: ['input1', 'input2']
        - type: 'SUB'
      [1]:
        ['INS', ['input3']]
        - args: ['input3']
        - type: 'INS'

You can choose which approach you are more comfortable with.