Search code examples
pythoninformation-retrievalwhoosh

NEAR type query in whoosh


In the slides from the PyCon 2013 there is a mention to NEAR type queries. I've looked through the documentation and there is no mention to the NEAR keyword in the queries. I could only find something similar, this:

"whoosh library"~5

which matches if a document has 'library' within 5 words after 'whoosh'

I was wondering whether there is a way to make this kind of query:

'whoosh' NEAR:X 'python' NEAR:X 'retrieval'

where X represents the maximum number of words between the query words (i.e., 'whoosh', 'python', 'retrieval')


Solution

  • I went through the documentation again and found the SpanNear2 class, this seems to do the job, example for three terms:

       t1 = query.Term("sentence", "Whoosh")
       t2 = query.Term("sentence", "python")
       t3 = query.Term("sentence", "retrieval")
       q = spans.SpanNear2([t1, t2, t3], slop=5, ordered=True)
    

    This would match a document containing a sentence like:

      "The Whoosh project is a python library for information retrieval."
    

    but not this sentence:

      "Whoosh is a great open source project is a python for information retrieval."
    

    since there are 8 tokens between 'Whoosh' and and python, and slop=5