Search code examples
pythonparsingmarkuppyparsingcreole

Pyparsing - Rule Ambiguity


I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:

Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}

The way I'm parsing the image link is as follows (note that this converts perfectly fine):

def parse_image(s, l, t):
    try:
        link, title = t[0].split("|")
    except ValueError:
        raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
    return '<img src="{0}" alt="{1}" />'.format(link, title)

image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)

Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:

n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])

However, when I try to run the following test case:

text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")

I get the following stack trace:

Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting//  (at char 0), (line:1, col:1)

From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?


Solution

  • The issue is with this expression:

    text = italic | bold | hr | newline | image | n
    

    Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.

    You can change the evaluation to use "longest match" by using the '^' operator instead:

    text = italic ^ bold ^ hr ^ newline ^ image ^ n
    

    This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.

    An easier solution is to just reorder the expressions in your list of alternatives: test for n before image:

    text = italic | bold | hr | newline | n | image
    

    Now when evaluating alternatives, it will look for the leading {{{ of n before the leading {{ of image.

    This often crops up when people define numeric terms, and accidentally define something like:

    integer = Word(nums)
    realnumber = Combine(Word(nums) + '.' + Word(nums))
    number = integer | realnumber
    

    In this case, number will never match a realnumber, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:

    number = realnumber | integer