Basically I do have pretty simple database that I'd like to index with Lucene. Domains are:
// Person domain
class Person {
Set<Pair> keys;
// Pair domain
class Pair {
KeyItem keyItem;
String value;
// KeyItem domain, name is unique field within the DB (!!)
class KeyItem{
String name;
I've tens of millions of profiles and hundreds of millions of Pairs, however, since most of KeyItem's "name" fields duplicates, there are only few dozens KeyItem instances. Came up to that structure to save on KeyItem instances.
Basically any Profile with any fields could be saved into that structure. Lets say we've profile with properties
- name: Andrew Morton
- eduction: University of New South Wales,
- country: Australia,
- occupation: Linux programmer.
To store it, we'll have single Profile instance, 4 KeyItem instances: name, education,country and occupation, and 4 Pair instances with values: "Andrew Morton", "University of New South Wales", "Australia" and "Linux Programmer".
All other profile will reference (all or some) same instances of KeyItem: name, education, country and occupation.
My question is, how to index all of that so I can search for Profile for some particular values of KeyItem::name and Pair::value. Ideally I'd like that kind of query to work:
name:Andrew* AND occupation:Linux*
Should I create custom Indexer and Searcher? Or I could use standard ones and just map KeyItem and Pair as Lucene components somehow?
I believe you can use standard Lucene methodology. I would:
If you choose bare Lucene, you will need a custom Indexer and Searcher, but they are not hard to build. It may be easier for you to use Solr, where you need less programming. However, I do not know if Solr allows an open-ended schema like the one I described - I believe you have to predefine all field names, so this may prevent you from using Solr.