Let's assume we are using pretty URLs with mod_rewrite
or something similar and have the following two routes:
/page
/page-two
Now we want to disallow only the first route (/page
) to be crawled by robots.
# robots.txt
User-agent: *
Disallow: /page
Disallow (http://www.robotstxt.org/orig.html):
... For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
So the above robots.txt
example is disallowing /page-two
too, correct?
What is the correct way to get this done?
May it be the following code?
# robots.txt
User-agent: *
Disallow: /page/
From Google's robots.txt specifications:
At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.
This means that it doesn't matter in what order you define them. In your case this should work:
User-agent: *
Disallow: /page
Allow: /page-
To make it more clear: Every url is matched against all paths. /page
will match /page/123
, /page/subdirectory/123/whateverishere.html
, /page-123
and /page
. The directive with the longest path that matches will be used. If both /page
and /page-
match, then the directive for /page-
is used (Allow). If /page
matches, but /page-
doesn't match, the directive for /page
is used (Disallow). If neither /page
and /page-
match, the default is assumed (Allow).