Recently I saw a site's robots.txt as follows:
User-agent: *
Allow: /login
Allow: /register
I could find only Allow
entries and no Disallow
entries.
From this, I could understand robots.txt is nearly a blacklist file to Disallow
pages to be crawled. So, Allow
is used only to allow a sub part of domain which is already blocked with Disallow
. Similar to this:
Allow: /crawlthis
Disallow: /
But, that robots.txt has no Disallow
entries. So, does this robots.txt let Google crawl all the pages? Or, does it allow only the specified pages tagged with Allow
?
You are right that this robots.txt file allows Google to crawl all the pages on the website. A thorough guide can be found here: http://www.robotstxt.org/robotstxt.html.
If you want googleBot to only be allowed to crawl the specified pages then correct format would be:
User Agent:*
Disallow:/
Allow: /login
Allow: /register
(I would normally disallow those specific pages though as they don't provide much value to searchers.)
It's important to note that the Allow command line only works with some robots (including Googlebot)