Search code examples
full-text-searchinformation-retrievalwhoosh

Whoosh Proxmity search


I would like to know, how to use proximity search with the whoosh. I have read the documentation of the whoosh. It was written in the document that by using class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) once can able to use the proximity search.

for example, I need to find "Hello World" in the index, but "Hello" should have 5-word distance from the word "World".

As of now, I am using the following code and its working fine with the normal parser.

from whoosh.query import *
from whoosh import qparser

index_path = "/home/abhi/Desktop/CLIR/indexdir_test"

ix = open_dir(index_path)

query='Hello World'

ana = StandardAnalyzer(stoplist=stop_word)


qp = QueryParser("content", schema=ix.schema,termclass=Phrase)
q=qp.parse(query)
with ix.searcher() as s:
   results = s.search(qp,limit=5)
   for result in results:
       print(result['content']+result['title'])
       print (result.score)
   print(len(results)) 

Guys, please help me how to use the class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)' to use the proximity search and varies the distance between the words. Thanks in Advance


Solution

  • What you want is a slop factor of 5.

    A few points:

    1. When you search, you must pass the query (q), not the query parser (qp): results = s.search(q, limit=5)

    2. limit refers to the maximum number of documents to return, not the slop factor. Your limit=5 parameter is saying you want to get up to 5 search results back (in case you were thinking this is the slop).

    3. You can remove termclass=Phrase

    You can construct a phrase query two ways:

    1. Using a query string. Good for passing along a user query. Append ~ and the slop factor to the phrase for proximity search. If you want phrase terms to be up to 5 words apart: "hello world"~5
    2. Using a SpanNear2 query. Allows you to programmatically structure it the way you want. Pass all your phrase terms as an array of Term objects and specify slop as a constructor parameter.
    from whoosh.query import spans
    
    with ix.searcher() as s:
    
    # Option 1: Query string
      query   = '"Hello World"~5'
      qp      = QueryParser("content", schema=ix.schema)
      q       = qp.parse(query)
      results = s.search(q, limit=5)
    
    # Option 2: SpanNear2
      q = spans.SpanNear2([Term("content", "Hello"), Term("content", "world")], slop=5)
      results = s.search(q, limit=5)