I'm trying to use LucidWorks (http://www.lucidimagination.com/products/lucidworks-search-platform) as a search engine for my organization intranet. I want it to index various document-types (Office formats, PDFs, web pages) from various data sources (web & wiki, file system, Subversion repositories). So far I tried indexing several sites, directories & repositories (about 500K documents, with total size of about 50GB) - and the size of the index is 155GB.
Is this reasonable? Should the index occupy more storage than the data itself? What would be a reasonable thumb-rule for data-size to index-size ratio?
There is no reasonable size of index, basically depends upon the the data you have.
Ideally should be less, but there is no thumb rule.
However, For the index size and the data size, depends upon how you are indexing the data.
Many factors would determine and have affect on your index size.
Most of the space in the index is consumed by the Stored data fields.
If you are indexing the data from documents and all the content is stored, the index size will surely grow hugh.
Fine tuning of indexed fields attributes also helps in space saving.
You may want to revisit the fields which you need to be indexed and which needs to be stored.
Also, are you using lots of copyfields to duplicate data or maintaining repititive data.
Optimization might help as well.
More info @ http://wiki.apache.org/solr/SolrPerformanceFactors