Search code examples
pythonparsingcontext-free-grammarpyparsingcontext-sensitive-grammar

How to match with a expression grammar only the last time a keyword occurs occurs


I want to write a expression grammar which matches strings likes these:

words at the start ONE|ANOTHER wordAtTheEnd

---------^-------- ----^----- --^--
     A: alphas     B: choice  C: alphas

The issue is however that part A can contain the keyword "ONE" or "ANOTHER" from part B, so only the last occurrence of the choice keywords should match part B. Here an example: The string

ZERO ONE or TWO are numbers ANOTHER letsendhere

should be parsed into the fields

A: "ZERO ONE or TWO are numbers"
B: "ANOTHER"
C: "letsendhere"

With pyparsing I tried the "stopOn"-keyword for the OneorMore expression:

choice = pp.Or([pp.Keyword("ONE"), pp.Keyword("OTHER")])('B')
start = pp.OneOrMore(pp.Word(pp.alphas), stopOn=choice)('A')
end = pp.Word(pp.alphas)('C')
expr = (start + choice) + end

But this does not work. For the sample string I get the ParseException:

Expected end of text (at char 12), (line:1, col:13)
"ZERO ONE or >!<TWO are numbers ANOTHER text"

This makes sense, because stopOn stops on the first occurrence of choice not the last occurrence. How can I write a grammar which stops on the last occurrence instead? Maybe I need to resort to a context-sensitive grammar?


Solution

  • Sometimes you have to try to "be the parser". What is it about the "last occurrence of X" that distinguishes it from other X'es? One way to say this is "an X that is not followed by any more X's". With pyparsing, you could write a helper method like this:

    def last_occurrence_of(expr):
        return expr + ~FollowedBy(SkipTo(expr))
    

    Here it is in use as a stopOn argument to OneOrMore:

    integer = Word(nums)
    word = Word(alphas)
    list_of_words_and_ints = OneOrMore(integer | word, stopOn=last_occurrence_of(integer)) + integer
    
    print(list_of_words_and_ints.parseString("sldkfj 123 sdlkjff 123 lklj lkj 2344 234 lkj lkjj"))
    

    prints:

    ['sldkfj', '123', 'sdlkjff', '123', 'lklj', 'lkj', '2344', '234']