Search code examples
pythonparsingpyparsing

Dealing with ZeroOrMore in pyparsing


I'm trying to parse pactl list with pyparsing: So far all parse is working correctly but I cannot make ZeroOrMore to work correctly.

I can find foo: or foo: bar and try to deal with that with ZeroOrMore but it doesn't work, I have to add special case "Argument:" to find results without value, but there're Argument: foo results (with value) so it will not work, and I expect any other property to exist without value.

With this definition, and a fixed pactl list output:

#!/usr/bin/env python

#
# parsing pactl list
#

from pyparsing import *
import os
from subprocess import check_output
import sys

data = '''
Module #6
    Argument:
    Name: module-alsa-card
    Usage counter: 0
    Properties:
        module.author = "Lennart Poettering"
        module.description = "ALSA Card"
        module.version = "14.0-rebootstrapped"
'''

indentStack = [1]
stmt = Forward()

identifier = Word(alphanums+"-_.")

sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)

value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1)))))
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=")  + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)

stmt << ( section | prop | ("Argument:") | value | prop_val )

syntax = OneOrMore(stmt)

parseTree = syntax.parseString(data)
parseTree.pprint()

This gets:

$ ./pactl.py

Module #6
    Argument:
    Name: module-alsa-card
    Usage counter: 0
    Properties:
        module.author = "Lennart Poettering"
        module.description = "ALSA Card"
        module.version = "14.0-rebootstrapped"
[[['Module'], ['6']],
 [['Argument:'],
  [[['Name'], ['module-alsa-card']]],
  [[['Usage counter'], ['0']]],
  ['Properties:',
   [[[['module.author'], ['"Lennart Poettering"']]],
    [[['module.description'], ['"ALSA Card"']]],
    [[['module.version'], ['"14.0-rebootstrapped"']]]]]]]

So far so good, but removing special case for Argument: it gets into error, as ZeroOrMore doesn't behave as expected:

#!/usr/bin/env python

#
# parsing pactl list
#

from pyparsing import *
import os
from subprocess import check_output
import sys

data = '''
Module #6
    Argument:
    Name: module-alsa-card
    Usage counter: 0
    Properties:
        module.author = "Lennart Poettering"
        module.description = "ALSA Card"
        module.version = "14.0-rebootstrapped"
'''

indentStack = [1]
stmt = Forward()

identifier = Word(alphanums+"-_.")

sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)

value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=")  + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)

stmt << ( section | prop | value | prop_val )


syntax = OneOrMore(stmt)

parseTree = syntax.parseString(data)
parseTree.pprint()

This results in:

$ ./pactl.py

Module #6
    Argument:
    Name: module-alsa-card
    Usage counter: 0
    Properties:
        module.author = "Lennart Poettering"
        module.description = "ALSA Card"
        module.version = "14.0-rebootstrapped"
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 19(3,9)
Matched Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) -> [[['Argument'], ['Name']]]
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 1(2,1)
Exception raised:Expected ":", found '#'  (at char 8), (line:2, col:8)
Traceback (most recent call last):
  File "/home/alberto/projects/node/pacmd_list_json/./pactl.py", line 55, in <module>
    parseTree = syntax.parseString(partial)
  File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
    raise exc
  File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 6336, in checkUnindent
    raise ParseException(s, l, "not an unindent")
pyparsing.ParseException: Expected {{Group:({Group:(W:(ABCD...)) Suppress:("#") Group:(W:(0123...))}) indented block} | {"Properties:" indented block} | Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) | Group:({Group:(W:(ABCD...)) Suppress:("=") Group:(Combine:({{W:(ABCD...) | <SP><TAB>}}...))})}, found ':'  (at char 41), (line:4, col:13)

See from setDebug value grammar ZeroOrMore is getting the tokens from next line [[['Argument'], ['Name']]]

I tried LineEnd() and other tricks but none works.

Any idea on how to deal with ZeroOrMore to stop on LineEnd() or without special cases?

NOTE: Real output can be retrieved using:

env = os.environ.copy()
env['LANG'] = 'C'
data = check_output(
    ['pactl', 'list'], universal_newlines=True, env=env)

Solution

  • indentedBlock is not the easiest pyparsing element to work with. But there are a few things that you are doing that are getting in your way.

    To debug this, I broke down some of your more complex expressions, use setName() to give them names, and then added .setDebug(). Like this:

    identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
    

    This will tell pyparsing to output a message whenever this expression is about to be matched, if it matched successfully, or if not, the exception that was raised.

    Match identifier at loc 1(2,1)
    Matched identifier -> ['Module']
    Match identifier at loc 15(3,5)
    Matched identifier -> ['Argument']
    Match identifier at loc 15(3,5)
    Matched identifier -> ['Argument']
    Match identifier at loc 23(3,13)
    Exception raised:Expected identifier, found ':'  (at char 23), (line:3, col:13)
    

    It looks like these expressions are messing up the indentedBlock matching, by processing whitespace that should be indentation space:

    Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))
    

    The " character in the Word and the whitespace lead me to believe you are trying to match quoted strings. I replaced this expression with:

    Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString))
    

    You also need to take care not to read past the end of the line, or you'll also mess up the indentedBlock indentation tracking. I added this expression for a newline at the top:

    NL = LineEnd()
    

    and then used it as the stopOn argument to OneOrMore and ZeroOrMore:

    prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
    prop_val = Group(identifier + Suppress("=")  + Group(prop_val_value)).setName("prop_val")#.setDebug()
    

    Here is the parser I ended up with:

    indentStack = [1]
    stmt = Forward()
    NL = LineEnd()
    
    identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
    
    sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums))).setName("sect_def")#.setDebug()
    inner_section = indentedBlock(stmt, indentStack)
    section = (sect_def + inner_section)
    
    #~ value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
    value_label = originalTextFor(OneOrMore(identifier)).setName("value_label")#.setDebug()
    value = Group(value_label
                  + Suppress(":")
                  + Optional(~NL + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_.') | quotedString(), stopOn=NL))))).setName("value")#.setDebug()
    prop_name = Literal("Properties:")
    prop_section = indentedBlock(stmt, indentStack)
    #~ prop_val = Group(Group(identifier) + Suppress("=")  + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
    prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
    prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
    prop = (prop_name + prop_section).setName("prop")#.setDebug()
    
    stmt << ( section | prop | value | prop_val )
    

    Which gives this:

    [[['Module'], ['6']],
     [[['Argument']],
      [['Name', ['module-alsa-card']]],
      [['Usage counter', ['0']]],
      ['Properties:',
       [[['module.author', ['"Lennart Poettering"']]],
        [['module.description', ['"ALSA Card"']]],
        [['module.version', ['"14.0-rebootstrapped"']]]]]]]