Search code examples
frameworkslucenesearch-engine

Document similarity framework


I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.

I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.

For example, if I would implement it in Lucene, I would do the following:

  • Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
  • Than adding the created elements to the Lucene index
  • And finally using the MoreLikeThis class to find the similar documents

So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.

My question is: are the any applications/frameworks, which might fit for the above mentioned problem?

Thanks, krisy

UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.

There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...


Solution

  • Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.

    Also, the answer that @mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).