Search code examples
robots.txtgooglebot

Googlebot not respecting Robots.txt


For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!

I'm wondering if my comment or using absolute urls are screwing things up...

Any insight is appreciated. Thanks.


Solution

  • The reason why they are ignored is that you have the fully qualified URL in the robots.txt file for Disallow entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:

    Sitemap: /sitemap_index.xml
    
    User-agent: Mediapartners-Google
    Disallow: /scripts
    
    User-agent: *
    Disallow: /scripts
    # list of articles given by the Content group
    Disallow: /Living/books/book-review-not-stupid.aspx
    Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
    Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
    

    As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.