Search code examples
tf-idfwhoosh

How to get tf-idf score and bm25f score of a term in a document using whoosh?


I am using whoosh to index a dataset. I want to retrieve the td-idf score and bm25f score given a term and document? I have seen the scoring.TFIDF() and scoring.TFIDFScorer(). In order to call TFIDFScorer().score() method we should pass a matcher object. Which matcher object should I pass to it.

Similarly, what parameters should I pass to BM25FScorer()._score(self, weight, length)? What are weight and length parameters? What values are passed by default?


Solution

  • Finally able to figure it out. Here it is for anyone who come here later,

    For finding TFIDF and BM25F score of a term and document.

    qp = QueryParser('content', ix.schema)
    q = qp.parse(unicode('id:1'))
    with ix.searcher(weighting=scoring.TF_IDF()) as searcher_tfidf:
        scoring.TFIDF().scorer(searcher_tfidf, 'body', 'algebra').score(q.matcher(searcher_tfidf))
    with ix.searcher(weighting=scoring.BM25F()) as searcher_bm25f:
        scoring.BM25F().scorer(searcher_bm25f, 'body', 'algebra').score(q.matcher(searcher_bm25f))
    

    ix is IndexReader object obtained using open_dir() method or create_in(). The key is to get the Matcher object that matches exactly the required document. So, use an id or any unique field in the schema to get that particular document using qp.parse() method.