Search code examples
solrlucenesearch-engine

Solr indexing approach


I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents. from follwing 3 which approach do i use for indexing?? :

  1. Single field for two languages. advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.

  2. Language specific fields : a) possibly slower searching because of many fields.

  3. multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.


Solution

  • I suggest separate cores. IMHO, it's simply the right way to go.

    You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately. The only drawback is boilerplate config elements (most settings are the same for both cores).

    See this recent, similar post:

    Applying Language Specific Analyzer Dynamically before Solr Indexing