Search code examples
pythonfull-text-searchfuzzy-search

Fuzzy text search in Python


I am wondering if there is a Python library can conduct fuzzy text search. For example:

  • I have three keywords "letter", "stamp", and "mail".
  • I would like to have a function to check if those three words are within the same paragraph (or certain distances, one page).
  • In addition, those words have to maintain the same order. It is fine that other words appear between those three words.

I have tried fuzzywuzzy which did not solve my problem. Another library, Whoosh, looks powerful, but I did not find the proper function.


Solution

  • {1} You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:

    whoosh.qparser.FuzzyTermPlugin lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).

    To add the fuzzy plugin:

    parser = qparser.QueryParser("fieldname", my_index.schema)
    parser.add_plugin(qparser.FuzzyTermPlugin())
    

    Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.

    For example, the following “fuzzy” term query:

    letter~
    letter~2
    letter~2/3
    

    {2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:

    "letter~ stamp~ mail~"
    

    To replace the default phrase plugin with the sequence plugin:

    parser = qparser.QueryParser("fieldname", my_index.schema)
    parser.remove_plugin_class(qparser.PhrasePlugin)
    parser.add_plugin(qparser.SequencePlugin())
    

    {3} To allow words between, initialize the slop arg in your Phrase query to a greater number:

    whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
    

    slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

    You can also define slop in Query like this:

    "letter~ stamp~ mail~"~10
    

    {4} Overall solution:

    {4.a} Indexer would be like:

    from whoosh.index import create_in
    from whoosh.fields import *
    
    schema = Schema(title=TEXT(stored=True), content=TEXT)
    ix = create_in("indexdir", schema)
    writer = ix.writer()
    writer.add_document(title=u"First document", content=u"This is the first document we've added!")
    writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
    writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
    writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
    writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
    writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
    writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
    writer.commit()
    

    {4.b} Searcher would be like:

    from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin
    
    with ix.searcher() as searcher:
        parser = QueryParser(u"content", ix.schema)
        parser.add_plugin(FuzzyTermPlugin())
        parser.remove_plugin_class(PhrasePlugin)
        parser.add_plugin(SequencePlugin())
        query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
        results = searcher.search(query)
        print "nb of results =", len(results)
        for r in results:
            print r
    

    That gives the result:

    nb of results = 2
    <Hit {'title': u'Sixth document'}>
    <Hit {'title': u'Third document'}>
    

    {5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:

     from whoosh.query import FuzzyTerm
     parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
    

    Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance:

    class MyFuzzyTerm(FuzzyTerm):
         def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
             super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
             # super().__init__() for Python 3 I think
    

    References:

    1. whoosh.query.Phrase
    2. Adding fuzzy term queries
    3. Allowing complex phrase queries
    4. class whoosh.query.FuzzyTerm
    5. qparser module