Search code examples
pythonstringparsingpyparsing

Python Pyparsing: Identify Pattern of Optional List + Keyword + Optional List


I'm trying to parse Csound opcode lines with Pyparser (to create a custom auto-formatter) and I'm stuck trying to define the following formula:

optional comma list + keyword + optional comma list

The possible variations of a Csound line of code:

        prints  "int: %d%n", 45
a1      oscil   0.5, 440, -1
aL, aR  stereo  a1
        outs    aL, aR

And their syntaxes:

                opcode  string, param
output          opcode  param, param, param
output, output  opcode  param
                opcode  param, param

So far I have come up with this:

import pyparsing as pp

samples = [
    "prints \"int: %d%n\", 45",
    "a1 oscil 0.5, 440",
    "aL, aR stereo a1",
    "outs aL, aR"
]

var = pp.Word(pp.alphanums)

outputs = pp.Optional(pp.delimitedList(var))

opcode = pp.Word(pp.alphanums)

param = var | pp.dblQuotedString
params = pp.Optional(pp.delimitedList(param))

csound_line = outputs("outputs") \
    + opcode("opcode") \
    + params("params")

parsed = csound_line.parseString(samples[1])
print(parsed.dump())
parsed = csound_line.parseString(samples[2])
print(parsed.dump())
parsed = csound_line.parseString(samples[3])
print(parsed.dump())
parsed = csound_line.parseString(samples[0])
print(parsed.dump())

But it gives me this:

['a1', 'oscil', '0']
- opcode: 'oscil'
- outputs: ['a1']
- params: ['0']
['aL', 'aR', 'stereo', 'a1']
- opcode: 'stereo'
- outputs: ['aL', 'aR']
- params: ['a1']
['outs', 'aL']
- opcode: 'aL'
- outputs: ['outs']
Traceback (most recent call last):
  File "/home/oliver/projects/personal/local/csoundformat/./test.py", line 33, in <module>
    parsed = csound_line.parseString(samples[0])
  File "/usr/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
    raise exc
  File "/usr/lib/python3.9/site-packages/pyparsing.py", line 3250, in parseImpl
    raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected W:(ABCD...), found '"'  (at char 7), (line:1, col:8)

Parsing line 2 works correctly, but lines 1 and 3 are missing end parameters and line 0 causes a complaint about double quotes, despite my using dblQuotedString.

I just can't seem to nail down the right combination of Optional/ZeroOrMore and delimitedList. Any assistance would be appreciated.

Thanks!


Solution

  • A couple of notes before answering the core question:

    1. I found it quite confusing that you list the input samples in a different order from the execution runs. I guess I understand why you ran them in that order -- although a try block would have solved the problem -- but it would have been a lot more reader-friendly to just reorder the input list. Just sayin', for future reference.)

    2. The oscil line cuts off early because your param only recognises alphanumerics and quoted strings. Unsigned integers are made up of alphanums, but since . is not an alphanum, floating point numbers like 0.5 don't match. pp.Word(pp.alphanums) stops after the 0.

      You'd run into a similar problem with -1, with the difference that there is no previous digit which could be matched.


    The fundamental problem, though, is that the grammar is ambiguous unless you have a way of distinguishing an output variable from an opcode. Without that, a b can be parsed either as output(a) opcode(b) param() or output() opcode(a) param(b).

    Pyparsing's optional is greedy, so pp.optional(outputs) will parse the first var token as a one-element outputs. That means that it would parse a b as outputs(a) opcode(b) params(), which resolves the ambiguity but doesn't always produce the correct parse. It will do that even in cases where it is obviously wrong (to a human observer who can see the entire command), such as the commands op or op "foo". That fact that op will be parsed as an output, not an opcode, means that both of those will produce syntax errors. (That's the problem with the prints command.)

    So to parse a line, it's necessary to distinguish between the following cases (I use [...] to indicate optional):

    a , ...    => a is an output, ... is (more) outputs followed by opcode [params]
    a b , ...  => a is an opcode, b is a param, ... is more params
    a b c ...  => a is an output, b is an opcode, ... is [, params]
    a b        => ambiguous. Either output opcode or opcode param.
    a          => a is an opcode (and nothing follows)
    

    That's only an approximation, because a param could be a quoted string, a number, or (I suppose) an expression. I think the grammar below will correctly handle that case, but you'll need to expand the definition of param in order to try it.

    It would probably be more efficient to combine the third and fourth cases by using an opt_params instead of params; you can then detect the ambiguous case by the absence of params. But I left it like this to make the ambiguity clearer.

    var = pp.Word(pp.alphanums)
    
    outputs = pp.delimitedList(var)
    opt_outputs = pp.Optional(outputs)
    
    opcode = pp.Word(pp.alphanums)
    
    # Note: Probably need to add expressions. I added a very simple
    # floating point syntax, but it doesn't handle signs. (It has to go first
    # in order to avoid 'var' matching the initial integer.)
    param = pp.Combine(pp.Word(pp.nums) + '.' + pp.Word(pp.nums)) \
            | var \
            | pp.dblQuotedString
    
    params = pp.delimitedList(param)
    opt_params = pp.Optional(params)
    
    comma = pp.Suppress(',')
    
    csound_line = ( (var + comma + outputs)("outputs")
                    + opcode("opcode")
                    + opt_params("params")
                  ) | (
                    opcode("opcode") + (param + comma + params)("params")
                  ) | (
                    pp.And((var,))("outputs") + opcode("opcode") + params("params")
                  ) | (
                    (var + var)("ambiguous")
                  ) | (
                    opcode("opcode") + opt_params("params")
                  )
    

    (The use of pp.And((var,)) is to put the single output token into a list, for consistency with the other outputs parses. There might well be a better way to do that.)

    Note that pyparsing's parseString does not insist on parsing to the end of the input, which is why some of your test cases failed silently. I think it's better for failures to be explicit, so I added parseAll to the call. I also put a try block around it, to make the test a little easier to write:

    samples = """
        prints "int: %d%n", 45
        a1 oscil 0.5, 440
        aL, aR stereo a1
        outs aL, aR
        output opcode param
        opcode param
        output opcode
        opcode
    """
    for sample in samples.splitlines()[1:]:
        print(sample)
        try:
            print(csound_line.parseString(sample, parseAll=True).dump())
        except pp.ParseException as e:
            print("Parse failed:")
            print(e)
        print('-----------------------')
    

    Here's the test output:

        prints "int: %d%n", 45
    ['prints', '"int: %d%n"', '45']
    - opcode: 'prints'
    - params: ['"int: %d%n"', '45']
    -----------------------
        a1 oscil 0.5, 440
    ['a1', 'oscil', '0.5', '440']
    - opcode: 'oscil'
    - outputs: ['a1']
    - params: ['0.5', '440']
    -----------------------
        aL, aR stereo a1
    ['aL', 'aR', 'stereo', 'a1']
    - opcode: 'stereo'
    - outputs: ['aL', 'aR']
    - params: ['a1']
    -----------------------
        outs aL, aR
    ['outs', 'aL', 'aR']
    - opcode: 'outs'
    - params: ['aL', 'aR']
    -----------------------
        output opcode param
    ['output', 'opcode', 'param']
    - opcode: 'opcode'
    - outputs: ['output']
    - params: ['param']
    -----------------------
        opcode param
    ['opcode', 'param']
    - ambiguous: ['opcode', 'param']
    -----------------------
        output opcode
    ['output', 'opcode']
    - ambiguous: ['output', 'opcode']
    -----------------------
        opcode
    ['opcode']
    - opcode: 'opcode'
    -----------------------