I am developing a search engine based application and was working on Lucene java framework, i am being confused by the score functionality by default provided by lucene i.e do the score functionality implements by default tf-idf and cosine similarity or do we have to do something else ?
public class LuceneTester {
String indexDir = "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Index";
String dataDir = "C:\\Users\\hamda\\Documents\\NetBeansProjects\\luceneDemo\\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("DataGuides");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException{
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
i am getting the Document score in the end of the search function below
private void search(String searchQuery) throws IOException, ParseException{
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime));
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println(scoreDoc.score+" File: "
+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
I have googled it and found this: how can I implement the tf-idf and cosine similarity in Lucene? Any help will be highly appreciated :)
As of Lucene 6.0, the default similarity implementation is BM25Similarity, which implements BM25.
If you want to use the old standard similarity implementation, use ClassicSimilarity.
For a comparison of the two, you might check out: