I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:
1) FlashText (https://github.com/vi3k6i5/flashtext)
2) Aho-Corasick Algorithm etc.
However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.
I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?
What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields
texts = [
"Here's a sentence with dog and apple in it",
"Here's a sentence with dog and poodle in it",
"Here's a sentence with poodle and apple in it",
"Here's a dog with and apple and a poodle in it",
"Here's an apple with a dog to show that order is irrelevant"
]
schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()
writer = index.writer()
for t in texts:
writer.add_document(text = t)
writer.commit()
query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)
for r in results:
print(r)
This produces:
<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
You can also persist your index using FileStorage
as described in How to index documents.