Search code examples
searchlucenefull-text-searchsearch-enginesphinx

Searching Techniques Recommendations


This is more of a theory question rather than practice. I'm working on a project which is quite a simple catalog of links. The whole model is similar to the Dmoz or Yahoo catalog, except that each entry has certain additional attributes.

I have hierarchical taxonomy working on all entries with many-to-many relationship, all entries are now sorted into these categories, and everything seems to work fine. Now, what use is a catalog if there's no search option?

Here's a little bit more detail about my models: Each entry has a title, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry could have a logo attached to it, and a hidden field for tags. Also, the title and description are stored in three different languages. So basically I'd like the search results to be:

  1. Relevant (including taxonomy)
  2. Possibly ones with logos
  3. Possibly ones with 100% filled out profiles

I've tried Sphinx and currently working with Lucene, but it seems that I'm not getting the search right in theory. I hope it does make sense that filled entries should appear higher than the others, but I can't really figure out the scores. I wouldn't like irrelevant entries appear on top if there's simply one word match in the entire description, since titles are more relevant.

So my question is - are there any books, techniques or even other search engines (if Sphinx and Lucene are not good enough) that you would recommend for this matter? Not only I would like to get full control over search results and their ranking, but also give my visitors correct and relevant information.

Links on cool articles are appreciated too!

And No, I'm not trying to rebuild Google :)

Thanks :)


Solution

  • I'm pretty sure that Lucene is enough. We have solved similar task and did it well. Here are some hints that I can propose you looking back at my project at Lucene.Net .

    Taxonomy:

    • Category has represented as integer key in db, so each document has multiple instances of field 'CATEGORY' of type Number. For example document:[1,2,5,10, 'Wheel'] - means that wheel belongs to each of category.

    Non-searchable fields (logos, social profile):

    • Of course you can store non-searchable values in lucene's non-indexed fields. But we have stored all product related information in DB to avoid rebuilding Lucene's index. So Lucene owns only by ID of product and indexed but stored values for key fields.

    Three languages and multiple fields:

    • We have only 2 languages. So different titles of product can be stored in the same Lucene's document and relate to single ID of product (as I write before ID refers to DB). This allows you search product even if user request uses mix of languages.
    • Obviously title, tags and description have different weight for search result. Lucene handles it by assigning to field weight.