Search code examples
pythonperformanceparsingcase-insensitivepyparsing

Matching against a large number of strings containing spaces in pyparsing


With pyparsing I need to write a matcher for expressions like

a + names + c 

with

a = pp.OneOrMore(pp.Word(pp.alphas))
c = pp.OneOrMore(pp.Word(pp.nums))

and names matching one of many entries in the string list names_list.

The two complications are:

  1. The entries in names_list can contain spaces.
  2. The matching needs to be case-insensitive.
  3. names_list is rather large (~20000 entries)

I tried

names_kw_list = [pp.Keyword(name, caseless=True) for name in names_list ]
names = pp.Or(names_kw_list)

This does not work for entries with spaces plus I'm worried that this is not a very performant way to write this.

Any idea to get this working for spaces in entries and maybe make it perform faster?


Solution

  • A partial answer:

    The spaces problem can be solved with a correct stopOn function:

    def last_occurrence_of(expr):
        return expr + ~pp.FollowedBy(pp.SkipTo(expr))
    
    names_kw_list = [pp.Keyword(word, caseless=True)
                                           for word in names_list ]
    names = pp.Or(names_kw_list)("names")
    a = pp.OneOrMore(pp.Word(pp.alphas), stopOn=last_occurrence_of(names))("A")
    c = pp.OneOrMore(pp.Word(pp.nums))("C")
    
    expr = a + names + c 
    

    This instructs a not to eat into the strings of names.

    However the performance deteriorates, because now the long list of names is used in a stopOn expression.