Search code examples
phpsearch-enginephalcon

How do Lucene/Sphinx/Solr work?


I have a website in Phalcon and I'm trying to add a search engine to it. The content, however, is not in a DB and is in flat files.. located in app/views/.

I've never implemented a search engine, but from what I gather it seems like Lucene or Solr/Sphinx is what I need.

Do these tools offer the option to parse my website ala HTTrack, thus creating the index and necessary absolute URI hyperlinks?

How do I go about specifying what portion of the HTML files I want to be parsed? How do they interact with ignoring certain areas ( eg HTML, JS )?


Solution

  • Lucene is first and foremost an index. That's not even a database, it's just the index portion of the database if you will. It's highly configurable in what it indexes and how and what data should be retained in its original format and what can be discarded once it has been indexed. You create a schema first, just like you create a database schema. However, in the case of Lucene that schema defines what kind of tokenisers and filters to use to create the index for your fields. You then feed your documents into it to let it populate the index. That's up to you, there are several different APIs that let you feed data in. A "web crawler" is not one of them, it won't go out and find your data automatically. You can then query the index in various ways to retrieve documents you have fed in before. That's it in a nutshell.

    Lucene is pretty much exclusively the index engine, which is about tokenising and transforming text and other data into an index that can be queried quickly. It's the part that let's you query for "manufacturer of widgets" and return a document with the text "widget manufacturers", if you have tweaked your indexing and querying accordingly. Solr is an appliance wrapped around Lucene that adds an HTTP based API and some other niceties. Both are still somewhat low-level tools you can use to build a search engine. It's not an out-of-the-box "search engine" like Google by any means.