Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.
Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.
If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.
The web server which triggers the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.
Here are some of the config elements in solrconfig.xml:
<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>
I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.
What might cause this duplicates problem? Thanks for all input!
To answer your question completely i should be able to know the schema. There is a unique id field in the schema that works more like the unique key in the db, make sure the unique identifier of the document is made the unique key then the duplicates will be overwritten to keep just one value.