I need to check for the presence of one keyword from the list of words. If the list is small, I can explicitly list all the keywords
ticker = CaselessKeyword('SPY') | CaselessKeyword('QQQ')
ticker.run_tests(['SPY', 'QQQ'])
But what is the correct solution if the list is really large (10k-100k keywords) and we want to be sure that in this place there is one of these words and only it?
Rather than create thousands of CaselessKeywords
in a big MatchFirst
, you are probably better off writing an expression that just matches a word of characters, and then validates against a set of strings that are known ticker symbols.
See the code below, that uses a parse action to upcase the found words, and then a condition to filter for valid ticker symbols.
I defined ticker_symbols
using a space-delimited string to make it easy to add new symbols. Just add new symbols to this list, separated by spaces, and add line breaks as needed.
But no matter what, when the list of symbols gets up into the thousands (and hundred-thousands!), this process will be pretty slow.
import pyparsing as pp
# define all valid ticker symbols - save in a set for fast "in" testing
ticker_symbols = set("""\
AAPL IBM GE F TTT QQQ SPY GOOG INTL TI HP AMD
""".split())
# parse action condition to see if a word matches any ticker symbol
def is_ticker(t):
return t[0] in ticker_symbols
# expression to match any word of characters, detecting word breaks
ticker = pp.Regex(r"\b[A-Za-z]+\b")
# add upper-casing parse action and ticker-validating condition
ticker.addParseAction(pp.common.upcaseTokens)
ticker.addCondition(is_ticker)
# try it out
print(ticker.searchString("AAA SPYY QQQ INTL1 A AAPL ge zaapl"))
Prints
[['QQQ'], ['AAPL'], ['GE']]