I want to prevent data harvesting in my site (except googlebot of course). I am guessing relying on the UserAgent of GB is not strong enough (every bot can fake it)
How can I still authenticate GoogleBot to avoid fakes.
The official way is by using a combination of forward and reverse DNS lookups; they can't fake that!
More information is here from Google's Webmaster blog: How to verify Googlebot
Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:
> host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. > host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1
I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to
crawl-a-b-c-d.googlebot.com
.
However, I recommend caching the results of this per-IP lookup and only performing it periodically so as not to introduce too much overhead through your validation process.