google-app-engine python-2.7 jinja2 robots.txt webapp2

How to disallow bots that don't obey robots.txt?

I recently changed my robots.txt to disallow bots from making expensive search API queries. They are allowed all other pages right now except the /q?... that is a search API query and expensive.

User-agent: *
Disallow: /q?

Sitemap: /sitemap.xml.gz

Now I'm still getting bots in my logs. Is it google or just "googlebot compatible"? How can I disallow bots completely from /q?

2014-10-18 21:04:23.474 /q?query=category%3D5030%20and%20cityID%3D4698187&o=4 200 261ms 7kb Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) module=default version=disallow
66.249.79.28 - - [18/Oct/2014:12:04:23 -0700] "GET /q?query=category%3D5030%20and%20cityID%3D4698187&o=4 HTTP/1.1" 200 8005 - "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  ms=261 cpu_ms=108 cpm_usd=0.050895 app_engine_release=1.9.13 instance=00c61b117cdfd20321977d865dd08cef54e2fa

Can I blacklist specific bots according to their http headers in my request handler or in my dos.yamlif robots.txt can't do it? When I run this to look for matches, there are 50 matches last 2 hours:

path:/q.* useragent:.*Googlebot.*

The log rows look like this, looking like googlebot:

2014-10-19 00:37:34.449 /q?query=category%3D1030%20and%20cityID%3D4752198&o=18 200 138ms 7kb Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) module=default version=disallow 66.249.79.102 - - [18/Oct/2014:15:37:34 -0700] "GET /q?query=category%3D1030%20and%20cityID%3D4752198&o=18 HTTP/1.1" 200 7965 - "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "www.classifiedsmarket.appspot.com" ms=138 cpu_ms=64 cpm_usd=0.050890 app_engine_release=1.9.13 instance=00c61b117c781458f46764c359368330c7d7fdc4

Solution

Yes, every visitor/bot can claim to be Googlebot/2.1 (by changing the User-Agent header).

You can verify that it was the real Googlebot with a reverse DNS lookup.

According to the IPs from your logs, it seems to have been the real bot. And your robots.txt is correct, too. So it should only be a matter of time until Google recognizes the new rule, after which all requests should stop.

Bots that don’t honor your robots.txt can of course be blocked from accessing the resources, but (depending on your criteria for identifying bots) this bears the risk of blocking human visitors, too.