full-text-search frequency full-text-indexing network-drive frequency-distribution

Network drive indexing frequency

The company I work for have millions of documents that are stored and shared on multiple network drives mapped to users' drives (e.g.] d:\ to \server1\, etc).

What I'd like to implement is to crawl over network drives and let users find files fast using a full-text indexing.

My current indexing strategy is Lucene.net

But I am not sure how often I should be indexing network drives because there are millions of documents to index and not to mention packets traveling over network.

So the question is how should I implement indexing frequency?
I've been doing researches on how often Google/Windows Desktop searches index as an example but been fruitless.

Solution

A lot of the answer is wrapped up in whatever service level agreements you have with your customers. If your SLA states that search results are current within X number of minutes, than that answers your question on how you should implement indexing frequency.

If you, like me, do not have concrete SLA's in place for searching and indexing, then you can be more flexible. For example, I manage, among other things, a SharePoint Search server for my business. In addition to our web site, we also index a lot of content in unstructured file space. The server supports full and incremental crawls. We timed several incremental crawls to get an estimate of how long it takes to complete an incremental crawl. We then scheduled our incremental crawls on an interval comfortably larger than the observed elapsed time. We scheduled full crawls to occur less frequently at non-peak times.

The specifics may vary depending on the specific indexing technology you use, but the principle is the same:

Observe a few crawls, preferably at peak and non-peak times, and configure your crawling schedule to be comfortably larger than the worst case.
Schedule more resource-intensive crawls for non-peak times, such as evenings.
If a full crawl takes more than a few hours to complete, then you'll likely schedule them for the weekend.
Using a technology that supports incremental crawls can substantially reduce bandwidth during peak times while still keeping your index fresh.

Good luck!