I only serve images to my CDN.
I have a robots.txt file set up in my CDN domain which is separate from the one set up in my 'normal' www domain.
I want to format the CDN robots.txt file in my CDN domain so that it blocks the indexing of everything except images (regardless of their location)?
The reason for all this is that I want to avoid duplicate content.
Is this correct?
User-agent: *
Disallow: /
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$
If you have all images in certain folders, you could use:
For google-bot only:
User-agent: Googlebot-Image
Allow: /some-images-folder/
For all user-agents:
User-agent: *
Allow: /some-images-folder/
Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name.
To allow a specific file type (for example.gif images) you can use following robots.txt entry:
User-agent: Googlebot-Image
Allow: /*.gif$
Info 1: By default (in case you don't have a robots.txt), all content is crawled.
Info 2: The Allow statement should come before the Disallow statement, no matter how specific your statements are..
Here's a wiki link to the robot's exclusion standard for a more detailed description.
According to that, your example should look like:
User-agent: *
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.png$
Disallow: /
NOTE: As nev pointed out in his comment it's also important to watch out for query strings at the end of extensions, like image.jpg?x12345
, so also include
Allow: /*.jpg?*$