Search code examples
pythonsearchwhoosh

Python: Whoosh search for a non-exact query


Is it possible to use Whoosh to search for documents that do not exactly match the query, but are very close to it? For example, only one word is missing in the query to find something.

I wrote a simple code that works if the query covers all documents:

import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser


if not os.path.exists("index"):
    os.mkdir("index")

schema = Schema(title=TEXT(stored=True))
ix = create_in("index", schema)
ix = open_dir("index")

writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.commit()

with ix.searcher() as searcher:
    parser = QueryParser('title', ix.schema)
    myquery = parser.parse(u'TV HD')
    results = searcher.search(myquery)
    
    for result in results:
        print(result)

Unfortunately, if I change the query to one of the queries below, I won't be able to find all 3 documents (or find none at all):

myquery = parser.parse(u'TV Ultra HD')  # 2 Hits
myquery = parser.parse(u'TV 4K Ultra HD')  # 1 Hit
myquery = parser.parse(u'TV HD 2022')  # 0 Hit

Is it possible to create a parse so that any of these queries still return 3 documents even if the title field is slightly different?


Solution

  • After some thought, I came to the usual enumeration of all combinations of words.

    I added a variable tolerance - this is the maximum number of words that can be cut from the original request. Also added a separate method getResults(words, tolerance).

    The final code is:

    import os.path
    from whoosh.fields import Schema, TEXT
    from whoosh.index import create_in, open_dir
    from whoosh.qparser import QueryParser
    from whoosh.searching import Results
    from itertools import combinations
    
    
    def getResults(words: list, tol: int) -> Results:
        count = len(words)
        
        for tol in range(tolerance):
            if count - tol <= 0:
                return None
            
            for variant in combinations(words, count - tolerance):
                myquery = parser.parse(' '.join(variant))
                results = searcher.search(myquery)
                
                if results:
                    return results
        
        return None
    
    
    if not os.path.exists("index"):
        os.mkdir("index")
    
    schema = Schema(title=TEXT(stored=True, spelling=True))
    ix = create_in("index", schema)
    ix = open_dir("index")
    
    writer = ix.writer()
    writer.add_document(title=u'TV Ultra HD')
    writer.add_document(title=u'TV 4K Ultra HD')
    writer.add_document(title=u'TV HD 2022')
    writer.commit()
    
    with ix.searcher() as searcher:
        parser = QueryParser('title', ix.schema)
        words = u'TV HD 2022'.split(' ')
        tolerance = 1  # New variable
        results = getResults(words, tolerance)
        
        for result in results:
            print(result)
    

    The result is 3 Hits:

    <Hit {'title': 'TV Ultra HD'}>
    <Hit {'title': 'TV HD 2022'}>
    <Hit {'title': 'TV 4K Ultra HD'}>
    

    But I consider this a bad decision, because it seems to me that in Whoosh this can be implemented much more concisely