Google's Webmaster guidelines state
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
My ASP.NET 1.1 site uses custom authentication/authorization and relies pretty heavily on session guids (similar to this approach). I'm worried that allowing non-session tracked traffic will either break my existing code or introduce security vulnerabilities.
What best practices are there for allowing non-session tracked bots to crawl a normally session tracked site? And are there any ways of detecting search bots other than inspecting the user agent (i don't want people to spoof themselves as googlebot to get around my session tracking)?
The correct way to detect bots is by host entry (Dns.GetHostEntry
). Some lame robots require you to track by ip address, but the popular ones generally don't. Googlebot requests come from *.googlebot.com. After you get the host entry, you should check in the IPHostEntry.AddressList
to make sure it contains the original ip address.
Do not even look at the user agent when verifying robots.
See also http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html