Search code examples
googlebotdata-harvest

How to validate GoogleBot


I want to prevent data harvesting in my site (except googlebot of course). I am guessing relying on the UserAgent of GB is not strong enough (every bot can fake it)

How can I still authenticate GoogleBot to avoid fakes.


Solution

  • The official way is by using a combination of forward and reverse DNS lookups; they can't fake that!

    More information is here from Google's Webmaster blog: How to verify Googlebot

    Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

    > host 66.249.66.1
    1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
    
    > host crawl-66-249-66-1.googlebot.com
    crawl-66-249-66-1.googlebot.com has address 66.249.66.1
    

    I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

    However, I recommend caching the results of this per-IP lookup and only performing it periodically so as not to introduce too much overhead through your validation process.