Search code examples
javasortinglucenerankingtf-idf

Customize score for certain condition in Lucene TFIDF


I have a program that takes an input query and ranks the similar documents based on its TFIDF score. The thing is, I want to add some keywords and treat them as the "input" as well. These keywords will be different for each query.

For example if the query is "Logic Based Knowledge Representation" the words are as follows:

Level 0 keywords: [logic, base, knowledg, represent]

Level 1 keywords: [tempor, modal, logic, resolut, method, decis, problem,
                   reason, revis, hybrid, represent]

Level 2 keywords: [classif, queri, process, techniqu, candid, semant, data, 
                   model, knowledg, base, commun, softwar, engin, subsumpt,
                   kl, undecid, classic, structur, object, field]

I want to treat the scoring differently for example for a term in a document that equals to word in Level 0, I want to multiply the score with 1. For a term in a document that equals to words in level 1, multiply the score with 0.8. And finally, for a term in a document that equals to words in level 2 multiply the score with 0.64.

My purpose is to expand the input query but also making sure that documents containing more keywords from level 0 are treated more important, and documents containing keywords from level 1 and 2 are less (even though the input are expanded). I have not included this in my program. My program so far only count the TFIDF score for all documents to the query and ranks the result:

public class Ranking{

    private static int maxHits = 2000000;

    public static void main(String[] args) throws Exception {        
        System.out.println("Enter your paper title: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        String paperTitle = null;
        paperTitle = br.readLine(); 

       // CitedKeywords ckeywords = new CitedKeywords();
       // ckeywords.readDataBase(paperTitle);

        String querystr = args.length > 0 ? args[0] :paperTitle;
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        Query q = new QueryParser(Version.LUCENE_42, "title", analyzer)
            .parse(querystr);

        IndexReader reader = DirectoryReader.open(
                             FSDirectory.open(
                             new File("E:/Lucene/new_bigdataset_index")));        

        IndexSearcher searcher = new IndexSearcher(reader);

        VSMSimilarity vsmSimiliarty = new VSMSimilarity();  
        searcher.setSimilarity(vsmSimiliarty);
        TopDocs hits = searcher.search(q, maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;

        PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");

        int counter = 0;
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            float score = sd.score;
            int docId = sd.doc;
            Document d = searcher.doc(docId);
            String fileName = d.get("title");
            String year = d.get("pub_year");
            String paperkey = d.get("paperkey");
            System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
            writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
        ++counter;
        }
        writer.close();      
    }
}    

--

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;

public class VSMSimilarity extends DefaultSimilarity{

    // Weighting codes
    public boolean doBasic     = true;  // Basic tf-idf
    public boolean doSublinear = false; // Sublinear tf-idf
    public boolean doBoolean   = false; // Boolean

    //Scoring codes
    public boolean doCosine    = true;
    public boolean doOverlap   = false;

    private static final long serialVersionUID = 4697609598242172599L;

    // term frequency in document = 
    // measure of how often a term appears in the document
    public float tf(int freq) {     
        // Sublinear tf weighting. Equation taken from [1], pg 127, eq 6.13.
        if (doSublinear){
            if (freq > 0){
                return 1 + (float)Math.log(freq);
            } else {
                return 0;
            }
        } else if (doBoolean){
            return 1;
        }
        // else: doBasic
        // The default behaviour of Lucene is sqrt(freq), 
        // but we are implementing the basic VSM model
        return freq;
    }

    // inverse document frequency = 
    // measure of how often the term appears across the index
    public float idf(int docFreq, int numDocs) {
        if (doBoolean || doOverlap){
            return 1;
        }
        // The default behaviour of Lucene is 
        // 1 + log (numDocs/(docFreq+1)), 
        // which is what we want (default VSM model)
        return super.idf(docFreq, numDocs); 
    }

    // normalization factor so that queries can be compared 
    public float queryNorm(float sumOfSquaredWeights){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.queryNorm(sumOfSquaredWeights);
        }
        // else: can't get here
        return super.queryNorm(sumOfSquaredWeights);
    }

    // number of terms in the query that were found in the document
    public float coord(int overlap, int maxOverlap) {
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return 1;
        }
        // else: can't get here
        return super.coord(overlap, maxOverlap);
    }

    // Note: this happens an index time, which we don't take advantage of
    // (too many indices!)
    public float computeNorm(String fieldName, FieldInvertState state){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.computeNorm(state);
        }
        // else: can't get here
        return super.computeNorm(state);
    }
}

Below is the sample output of my current program (without boosting score):

3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663

Can anybody please let me know how to add score for the conditions I mentioned above? Does Lucene provide this kind of function? Can I integrate it to the VSMSimilarity class?

EDIT: I found this in Lucene documentation:

 public void setBoost(float b)

Sets the boost for this query clause to b. Documents matching this clause will (in addition to the normal weightings) have their score multiplied by b.

Unfortunately, this seems like it multiply the score for document level. I want to do the score multiplying for a term level and I havent found out the way to do this yet. So if a document contains words from level0 and level1, only the term from level1 will be multiplied by 0.8, for example


Solution

  • You can use Lucene term boosts.

    https://lucene.apache.org/core/5_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term

    Augment your query like (assuming OR is default operator)

    logic base knowledge representation temporal^0.8 modal^0.8 classification^0.64...
    

    And use one of standard simularity providers.

    PS: Found LUCENE_42 in your example. This feature exists in almost any version of Lucene (I remember it was here in 2.4.9).