Search code examples
solrlucenefull-text-searchmorelikethis

Sunspot / Solr / Lucene : Find similar article


Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).

How can be used to find similar articles with a given article?

Should this be done with a resuming tool, like: http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?


Solution

  • What you are trying to do is very similar to the task I outlined in this answer.

    In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.