I'm trying to parse Csound opcode lines with Pyparser (to create a custom auto-formatter) and I'm stuck trying to define the following formula:
optional comma list + keyword + optional comma list
The possible variations of a Csound line of code:
prints "int: %d%n", 45
a1 oscil 0.5, 440, -1
aL, aR stereo a1
outs aL, aR
And their syntaxes:
opcode string, param
output opcode param, param, param
output, output opcode param
opcode param, param
So far I have come up with this:
import pyparsing as pp
samples = [
"prints \"int: %d%n\", 45",
"a1 oscil 0.5, 440",
"aL, aR stereo a1",
"outs aL, aR"
]
var = pp.Word(pp.alphanums)
outputs = pp.Optional(pp.delimitedList(var))
opcode = pp.Word(pp.alphanums)
param = var | pp.dblQuotedString
params = pp.Optional(pp.delimitedList(param))
csound_line = outputs("outputs") \
+ opcode("opcode") \
+ params("params")
parsed = csound_line.parseString(samples[1])
print(parsed.dump())
parsed = csound_line.parseString(samples[2])
print(parsed.dump())
parsed = csound_line.parseString(samples[3])
print(parsed.dump())
parsed = csound_line.parseString(samples[0])
print(parsed.dump())
But it gives me this:
['a1', 'oscil', '0']
- opcode: 'oscil'
- outputs: ['a1']
- params: ['0']
['aL', 'aR', 'stereo', 'a1']
- opcode: 'stereo'
- outputs: ['aL', 'aR']
- params: ['a1']
['outs', 'aL']
- opcode: 'aL'
- outputs: ['outs']
Traceback (most recent call last):
File "/home/oliver/projects/personal/local/csoundformat/./test.py", line 33, in <module>
parsed = csound_line.parseString(samples[0])
File "/usr/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
raise exc
File "/usr/lib/python3.9/site-packages/pyparsing.py", line 3250, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected W:(ABCD...), found '"' (at char 7), (line:1, col:8)
Parsing line 2 works correctly, but lines 1 and 3 are missing end parameters and line 0 causes a complaint about double quotes, despite my using dblQuotedString
.
I just can't seem to nail down the right combination of Optional
/ZeroOrMore
and delimitedList
. Any assistance would be appreciated.
Thanks!
A couple of notes before answering the core question:
I found it quite confusing that you list the input samples in a different order from the execution runs. I guess I understand why you ran them in that order -- although a try
block would have solved the problem -- but it would have been a lot more reader-friendly to just reorder the input list. Just sayin', for future reference.)
The oscil
line cuts off early because your param
only recognises alphanumerics and quoted strings. Unsigned integers are made up of alphanums
, but since .
is not an alphanum, floating point numbers like 0.5
don't match. pp.Word(pp.alphanums)
stops after the 0
.
You'd run into a similar problem with -1
, with the difference that there is no previous digit which could be matched.
The fundamental problem, though, is that the grammar is ambiguous unless you have a way of distinguishing an output variable from an opcode. Without that, a b
can be parsed either as output(a) opcode(b) param()
or output() opcode(a) param(b)
.
Pyparsing's optional
is greedy, so pp.optional(outputs)
will parse the first var
token as a one-element outputs
. That means that it would parse a b
as outputs(a) opcode(b) params()
, which resolves the ambiguity but doesn't always produce the correct parse. It will do that even in cases where it is obviously wrong (to a human observer who can see the entire command), such as the commands op
or op "foo"
. That fact that op
will be parsed as an output, not an opcode, means that both of those will produce syntax errors. (That's the problem with the prints
command.)
So to parse a line, it's necessary to distinguish between the following cases (I use [...]
to indicate optional):
a , ... => a is an output, ... is (more) outputs followed by opcode [params]
a b , ... => a is an opcode, b is a param, ... is more params
a b c ... => a is an output, b is an opcode, ... is [, params]
a b => ambiguous. Either output opcode or opcode param.
a => a is an opcode (and nothing follows)
That's only an approximation, because a param could be a quoted string, a number, or (I suppose) an expression. I think the grammar below will correctly handle that case, but you'll need to expand the definition of param
in order to try it.
It would probably be more efficient to combine the third and fourth cases by using an opt_params
instead of params
; you can then detect the ambiguous case by the absence of params. But I left it like this to make the ambiguity clearer.
var = pp.Word(pp.alphanums)
outputs = pp.delimitedList(var)
opt_outputs = pp.Optional(outputs)
opcode = pp.Word(pp.alphanums)
# Note: Probably need to add expressions. I added a very simple
# floating point syntax, but it doesn't handle signs. (It has to go first
# in order to avoid 'var' matching the initial integer.)
param = pp.Combine(pp.Word(pp.nums) + '.' + pp.Word(pp.nums)) \
| var \
| pp.dblQuotedString
params = pp.delimitedList(param)
opt_params = pp.Optional(params)
comma = pp.Suppress(',')
csound_line = ( (var + comma + outputs)("outputs")
+ opcode("opcode")
+ opt_params("params")
) | (
opcode("opcode") + (param + comma + params)("params")
) | (
pp.And((var,))("outputs") + opcode("opcode") + params("params")
) | (
(var + var)("ambiguous")
) | (
opcode("opcode") + opt_params("params")
)
(The use of pp.And((var,))
is to put the single output token into a list, for consistency with the other outputs
parses. There might well be a better way to do that.)
Note that pyparsing's parseString
does not insist on parsing to the end of the input, which is why some of your test cases failed silently. I think it's better for failures to be explicit, so I added parseAll
to the call. I also put a try
block around it, to make the test a little easier to write:
samples = """
prints "int: %d%n", 45
a1 oscil 0.5, 440
aL, aR stereo a1
outs aL, aR
output opcode param
opcode param
output opcode
opcode
"""
for sample in samples.splitlines()[1:]:
print(sample)
try:
print(csound_line.parseString(sample, parseAll=True).dump())
except pp.ParseException as e:
print("Parse failed:")
print(e)
print('-----------------------')
Here's the test output:
prints "int: %d%n", 45
['prints', '"int: %d%n"', '45']
- opcode: 'prints'
- params: ['"int: %d%n"', '45']
-----------------------
a1 oscil 0.5, 440
['a1', 'oscil', '0.5', '440']
- opcode: 'oscil'
- outputs: ['a1']
- params: ['0.5', '440']
-----------------------
aL, aR stereo a1
['aL', 'aR', 'stereo', 'a1']
- opcode: 'stereo'
- outputs: ['aL', 'aR']
- params: ['a1']
-----------------------
outs aL, aR
['outs', 'aL', 'aR']
- opcode: 'outs'
- params: ['aL', 'aR']
-----------------------
output opcode param
['output', 'opcode', 'param']
- opcode: 'opcode'
- outputs: ['output']
- params: ['param']
-----------------------
opcode param
['opcode', 'param']
- ambiguous: ['opcode', 'param']
-----------------------
output opcode
['output', 'opcode']
- ambiguous: ['output', 'opcode']
-----------------------
opcode
['opcode']
- opcode: 'opcode'
-----------------------