Search code examples
ruby-on-railsseorobots.txt

Rails app and robots.txt best practice


I was wondering what is the standard practice for a rails app robots.txt file. i.e. which folders are generally prevented from the robots tracking.

My current robots.txt file is:

# User-agent: *
# Disallow: /
Disallow: /public/uploads/
Sitemap: www.mysite.co.za/sitemap.xml


My question is, do most people disallow /public/uploads from being tracked?

I also have a number of models/pages in which only the Admin user can access and perform the CRUD operations. This is protected by devise. I was wondering if it is necessary for these to be disallowed in the robots.txt file, and whether the spider is even able to index these pages (because they are protected for admin use only).

e.g. I have a category model to which only the admin can do CRUD on. Should I add:

Disallow: /categories/

(or is it with the *)

Disallow: /categories/*


These are all my queries around robots.txt usage in rails. Does this make sense?
Thanks,
Matt


Solution

  • Your robots.txt isn’t correct, as you have no User-agent line (at least one is required per block). (# starts comments, so the first two lines are comments.)

    Only you can decide if you want to disallow the crawling of URLs whose paths start with /public/uploads/. Are there resources you might want bots to access/crawl? If yes, don’t block it.

    Appending an * would block only that: URL paths that literally contain /public/uploads/* (some bots might give the * additional meaning, but this is not part of the original robots.txt specification). So you should not append an *.

    If your protection of the admin pages works, bots, of course, can’t visit the actual admin pages. They’ll probably see an error page (depending on your implementation). If you send the correct status code (e.g., 403, or 404), you don’t have to block them in your robots.txt. But it won’t hurt either (and can save you in situations where you really mess something up).

    Also, the content for Sitemap should be the full URL (you are omitting the protocol).