Reasons why lucene indexes get corrupt [Alfresco 4.2]

I am running alfresco 4.2 on REDHAT 7 server. so I have to deal with Lucene 2.4. The issue am dealing with is that Lucene indexes are being corrupted more and more often. Every time that happen the repo go off. and a full re indexation, make the server goes up again.

I need help to know more about what is causing the index corruption. and how to deal with (the reindexation take a lot of time)

Solution

Let me mention before I start in earnest: Alfresco implements Solr which uses Lucene for indexing, thus I wouldn't manage the Lucene indexes directly on Alfresco. Instead, manage your indexes via the Solr tooling Alfresco provides.

I, too, have found that the Lucene/Solr index tends to "drift" in this version of Alfresco (4.2.0). Having engaged Alfresco support on this many times, we've found no solid root cause; they say it may be attributed to "certain customizations" we've made, but they haven't been more specific than that.

So while we've not found a solution, there are proactive steps we take to mitigate the issue.

There is a Solr report we check daily (https://your-alfresco-server.com:8443/solr/report/). On this report, there is a value labeled, "Count of transactions in the index but not the DB" (which is a very misleading label, in my experience). The higher this value, the more out-of-sync our index seems to be, so as it climbs we'll schedule a re-index during a time when no one will be impacted.
There are services the Alfresco server exposes to fix and reindex Solr. (Full disclosure: I have not found them to be very effective, but they come recommended by Alfresco Support).

Solr re-index service: http://your-alfresco-server.com:8080/solr/admin/cores?action=REINDEX&txid=

Solr "Fix" service: http://your-alfresco-server.com:8080/solr/admin/cores?action=FIX

Purging stale content can reduce the time to re-index (this includes transfer reports, etc., that Alfresco generates that tends to accumulate, but aren't--in my case at least--important).

Unfortunately, the true solution often comes down to re-indexing on a scheduled, rotating basis to minimize downtime.