I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth. Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:
Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?
The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:
1) Host Load Schedule
This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)
2) Freshness Tuning
Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.
3) Crawl schedule
I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.
My recommendation for minimizing WAN traffic: 1) Review DNS and add an override if necessary to ensure you are routing to nearest content source 2) Set the content sources pattern to crawl infrequently 3) Create a meta url feed to push content updates.
The last one would take a bit of coding. There is an example sitemap feeder here: https://code.google.com/p/gsafeedmanager/
With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.
Alternate: 1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.