Search code examples
javalucenemorelikethis

How would you dynamically filter Lucene's MoreLikeThis?


Ok, so let me try to explain what I've done, and then hopefully what I'm asking will be more clear. I'm analyzing documents and trying to score them based on words that come up frequently in some docs despite being uncommon across the whole index. So far, I've gotten some pretty interesting results, and am able to see the tf and idf for each term in a given doc.

In order to score the doc as a whole, I want to do something tf-idf related, but I don't want to use every term in the doc. Right now, I've hardcoded some filters to get rid of overly common words (words whose idf is too low to matter to me), and overly uncommon words (words with really high idf scores; in my experience they are usually typos).

Is there a good way to filter out outliers in idf dynamically?
Instead of:

if (idf > x && idf < y)
   include the word

I want to do something like:

if (idf is in the 60th percentile of idfs for the index)
   include it      

Maybe that is the best way to do it, but I'd like to hear of any other solutions you may come up with, thanks!


Solution

  • One of the last steps in the scoring process is done by a Similarity object. I believe you only need to develop your on personalized Similarity. DefaultSimilarity is (obviously) the default class used by Lucene. It extends TFIDFSimilarity. I suggest you to read the code of both classes in order to understand how to develop your own class.

    Once the class is developed, assuming it's called KmancSimilarity, here is how to put it to run:

    Directory dir = <your dir>;
    IndexReader index = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(index);
    searcher.setSimilarity(new KmancSimilarity());
    
    continue your code...
    

    I've been working with version 4.8, so I don't know if it is valid for others.

    I hope it is helpful.