Search code examples
javalucenesearch-enginetf-idfcosine-similarity

Do Lucene(java framework) by default calculates the tf-idf and cosine similarity of a document against the term?


I am developing a search engine based application and was working on Lucene java framework, i am being confused by the score functionality by default provided by lucene i.e do the score functionality implements by default tf-idf and cosine similarity or do we have to do something else ?

public class LuceneTester {

String indexDir =  "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Index";
String dataDir =  "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
  LuceneTester tester;
  try {
     tester = new LuceneTester();
    tester.createIndex();
     tester.search("DataGuides");
  } catch (IOException e) {
     e.printStackTrace();
  } catch (ParseException e) {
     e.printStackTrace();
  }
}

private void createIndex() throws IOException{

  indexer = new Indexer(indexDir);
  int numIndexed;
  long startTime = System.currentTimeMillis();  
  numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
  long endTime = System.currentTimeMillis();
  indexer.close();
  System.out.println(numIndexed+" File indexed, time taken: "
     +(endTime-startTime)+" ms");       
}

i am getting the Document score in the end of the search function below

private void search(String searchQuery) throws IOException, ParseException{
  searcher = new Searcher(indexDir);
  long startTime = System.currentTimeMillis();
  TopDocs hits = searcher.search(searchQuery);
  long endTime = System.currentTimeMillis();

  System.out.println(hits.totalHits +
     " documents found. Time :" + (endTime - startTime));
  for(ScoreDoc scoreDoc : hits.scoreDocs) {
     Document doc = searcher.getDocument(scoreDoc);
        System.out.println(scoreDoc.score+" File: "
        + doc.get(LuceneConstants.FILE_PATH));
  }
  searcher.close();
}
}

I have googled it and found this: how can I implement the tf-idf and cosine similarity in Lucene? Any help will be highly appreciated :)


Solution

  • As of Lucene 6.0, the default similarity implementation is BM25Similarity, which implements BM25.

    If you want to use the old standard similarity implementation, use ClassicSimilarity.

    For a comparison of the two, you might check out: