How can one detect a crawler / spider using PHP?
I'm currently working on a project where I need to keep track of each crawler's visit.
I know that you should use HTTP_USER_AGENT but I'm not really sure how to format the code for this purpose and i know that the USER AGENT can be changed very easy so i would also like to know if it is possible to add some more parameters to avoid spoofing?
Sample code of what i'm trying to do..
<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>
Thank you
According to Verifying Googlebot:
You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.
For example:
host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).
You can do a reverse DNS lookup:
function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
return preg_match('/\.google(bot)?\.com$/i', $hostname);
}
if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone\'s faking it!';
}
} else {
echo 'Nothing to do with Google';
}