Search code examples
pyparsing

Why are keywords not parsed first and omitted from free text matching?


I thought I understood pyparsing's logic, but cannot figure out why the bottom example is failing.

I'm trying to parse open text comments where a product or set of products can be mentioned either in the beginning or the end of the comment. Product names can also be omitted from the comment.

The output should be a list of the mentioned products and the description regarding them.

Below are some test cases. The parse is identifying everything as 'description' instead of first picking up the products (isn't that what the negative is supposed to do?)

What's wrong in my understanding?

import pyparsing as pp

products_list = ['aaa', 'bbb', 'ccc']
products = pp.OneOrMore(' '.join(products_list))

word = ~products + pp.Word(pp.alphas)
description = pp.OneOrMore(word)

comment_expr = (pp.Optional(products("product1")) + description("description") + pp.Optional(products("product2")))

matches = comment_expr.scanString("""\
                                aaa is a good product
                                I prefer aaa
                                No comment
                                aaa bbb are both good products""")

for match in matches:
    print match

The expected results would be:

product1: aaa, description: is a good product
product2: aaa, description: I prefer
description: No comment
product1: [aaa, bbb] description: are both good products

Solution

  • Pyparsing's shortcut equivalence between strings and Literals is intended to be a convenience, but sometimes it results in unexpected and unwanted circumstances. In these lines:

    products_list = ['aaa', 'bbb', 'ccc']
    products = pp.OneOrMore(' '.join(products_list))
    

    . I'm pretty sure you wanted product to match on any product. But instead, OneOrMore gets passed this as its argument:

    ' '.join(products_list)
    

    This is purely a string expression, resulting in the string "aaa bbb ccc". Passing this to OneOrMore, you are saying that products is one or more instances of the string "aaa bbb ccc".

    To get the lookahead, you need to change products to:

    products = pp.oneOf(products_list)
    

    or even better:

    products = pp.MatchFirst(pp.Keyword(p) for p in products_list)
    

    Then your negative lookahead will work better.