For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:
Sitemap: http://[omitted]/sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!
I'm wondering if my comment or using absolute urls are screwing things up...
Any insight is appreciated. Thanks.
The reason why they are ignored is that you have the fully qualified URL in the robots.txt
file for Disallow
entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:
Sitemap: /sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.