Search code examples
pythonpyparsing

pyparsing - one of keywords from big list


I need to check for the presence of one keyword from the list of words. If the list is small, I can explicitly list all the keywords

ticker = CaselessKeyword('SPY') | CaselessKeyword('QQQ')
ticker.run_tests(['SPY', 'QQQ'])

But what is the correct solution if the list is really large (10k-100k keywords) and we want to be sure that in this place there is one of these words and only it?


Solution

  • Rather than create thousands of CaselessKeywords in a big MatchFirst, you are probably better off writing an expression that just matches a word of characters, and then validates against a set of strings that are known ticker symbols.

    See the code below, that uses a parse action to upcase the found words, and then a condition to filter for valid ticker symbols.

    I defined ticker_symbols using a space-delimited string to make it easy to add new symbols. Just add new symbols to this list, separated by spaces, and add line breaks as needed.

    But no matter what, when the list of symbols gets up into the thousands (and hundred-thousands!), this process will be pretty slow.

    import pyparsing as pp
    
    # define all valid ticker symbols - save in a set for fast "in" testing
    ticker_symbols = set("""\
    AAPL IBM GE F TTT QQQ SPY GOOG INTL TI HP AMD
    """.split())
    
    # parse action condition to see if a word matches any ticker symbol
    def is_ticker(t):
        return t[0] in ticker_symbols
    
    # expression to match any word of characters, detecting word breaks
    ticker = pp.Regex(r"\b[A-Za-z]+\b")
    
    # add upper-casing parse action and ticker-validating condition
    ticker.addParseAction(pp.common.upcaseTokens)
    ticker.addCondition(is_ticker)
    
    # try it out
    print(ticker.searchString("AAA SPYY QQQ INTL1 A AAPL ge zaapl"))
    

    Prints

    [['QQQ'], ['AAPL'], ['GE']]