I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:
Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}
The way I'm parsing the image link is as follows (note that this converts perfectly fine):
def parse_image(s, l, t):
try:
link, title = t[0].split("|")
except ValueError:
raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
return '<img src="{0}" alt="{1}" />'.format(link, title)
image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)
Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:
n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])
However, when I try to run the following test case:
text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")
I get the following stack trace:
Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting// (at char 0), (line:1, col:1)
From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?
The issue is with this expression:
text = italic | bold | hr | newline | image | n
Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.
You can change the evaluation to use "longest match" by using the '^' operator instead:
text = italic ^ bold ^ hr ^ newline ^ image ^ n
This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.
An easier solution is to just reorder the expressions in your list of alternatives: test for n
before image
:
text = italic | bold | hr | newline | n | image
Now when evaluating alternatives, it will look for the leading {{{
of n
before the leading {{
of image
.
This often crops up when people define numeric terms, and accidentally define something like:
integer = Word(nums)
realnumber = Combine(Word(nums) + '.' + Word(nums))
number = integer | realnumber
In this case, number
will never match a realnumber
, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:
number = realnumber | integer