Search code examples
network-programmingnetwork-trafficgoogle-search-appliance

Optimizing Google Search Appliance on a remote server


I'm planning to deploy a Google Search Appliance to remotely index an intranet site (transcontinentally). So I will be using the company's network and potentially consuming too much bandwidth. Regarding the configurations that I can use to mitigate the effect of the initial crawl (which is the only one that is perceived as dangerous for the network) we have:

  • Crawl and Index > Host Load Schedule
    • Web Server Host Load: basically number of concurrent connections to the crawled servers within 1 minute, so minimizing this setting should
    • Exceptions to Web Server Host Load: this is a schedule used for either increasing or decreasing the number of concurrent connections to the crawled server.
  • Crawl and Index > Crawl Schedule
    • Instead of a continous crawl I should choose a Scheduled crawl.

Am I on the right track and can other settings be configured in order not to generate excessive network traffic between the GSA and the Web servers?


Solution

  • The best way to minimize the crawling of a remote site is to not crawl it. Failing that, there are a couple of settings will help it it as noted out above:

    1) Host Load Schedule

    This sets the number of current threads set to the crawler for the host. Note that this can be a number below 1. (i.e. 2.5) (also noted by BigMikeW)

    2) Freshness Tuning

    Crawl infrequently actually means "Crawl never again". This works well in conjunction with a meta-url feed which will tell the GSA to recrawl the page or a recrawl request from the administrative console. Crawl frequently actually means: "Crawl Once Per Day". This setting doesn't really mean much now that the crawler has been retuned and the hardware is faster. The GSA will submit requests intra daily to the pages it finds.

    3) Crawl schedule

    I find that it's not better to turn off the crawler but rather keep it on continuous mode and set the threshold at zero. This allows the natural GSA algorithms to play out. Anything you wish to achieve by scheduling can be achieved by tuning it to zero for the periods you want the crawler quiet.

    My recommendation for minimizing WAN traffic: 1) Review DNS and add an override if necessary to ensure you are routing to nearest content source 2) Set the content sources pattern to crawl infrequently 3) Create a meta url feed to push content updates.

    The last one would take a bit of coding. There is an example sitemap feeder here: https://code.google.com/p/gsafeedmanager/

    With this configuration, the GSA will never recrawl the content and will rely on the feed to inform it of updates.

    Alternate: 1) Ensure the content source responds to HEAD requests with LAST Modified Dates. Do not configure crawl infrequently. The GSA will detect deltas and slow the crawl down over time.