I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
First, when executing exact query, the performance of Lucene is much better than that of unindexed-RDB, while is almost same as that of indexed-RDB. Second, when the wildcard query is a prefix query, then the indexed-RDB and Lucene both perform very well still by leveraging the index... Third, for combinational query, Lucene performs smoothly and usually costs little time, while the query time of RDB is related to the combinational search conditions and the number of indexed fields. If some fields in the combinational condition haven’t been indexed, search will cost much more time. Fourth, the query time of Lucene and unindexed-RDB has relations with the record complexity, but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.