Search code examples
pythonstringpattern-matchingstring-matchingkeyword-search

search keywords efficiently when keywords are multi words


I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick Algorithm etc.

However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.

I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?


Solution

  • What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

    from whoosh.filedb.filestore import RamStorage
    from whoosh.qparser import QueryParser
    from whoosh import fields
    
    
    texts = [
        "Here's a sentence with dog and apple in it",
        "Here's a sentence with dog and poodle in it",
        "Here's a sentence with poodle and apple in it",
        "Here's a dog with and apple and a poodle in it",
        "Here's an apple with a dog to show that order is irrelevant"
    ]
    
    schema = fields.Schema(text=fields.TEXT(stored=True))
    storage = RamStorage()
    index = storage.create_index(schema)
    storage.open_index()
    
    writer = index.writer()
    for t in texts:
        writer.add_document(text = t)
    writer.commit()
    
    query = QueryParser('text', schema).parse('dog apple')
    results = index.searcher().search(query)
    
    for r in results:
        print(r)
    

    This produces:

    <Hit {'text': "Here's a sentence with dog and apple in it"}>
    <Hit {'text': "Here's a dog with and apple and a poodle in it"}>
    <Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>
    

    You can also persist your index using FileStorage as described in How to index documents.