Search code examples
amazon-s3amazon-cloudfrontgoogle-search-consolex-robots-tag

How to stop google-bot from indexing a folder inside my s3 bucket?


I have an amazon s3 bucket with static website hosting setup + cloudfront. I have a folder inside the s3 bucket [ example.com/Books ] which contains pdf files. I've submitted a sitemap in google search console [ which doesn't contain any pdf urls ] , but google is indexing pdf files in the search results.

In the search console I've added a request to remove all the urls from search result with prefix [ example.com/Books/* ] to be removed immediately. I've searched how to stop indexing files and folders and found the I've to add " X-Robots-Tag: noindex " as http header meta data. How do I add that to an s3 bucket ? I've added custom metadata to the folder 'Books' x-amz-meta-X-Robots-Tag: noindex.

I've read numerous posts where it's written that I shouldn't block the bots from accessing that folder using Robots.txt , as It won't tell the search engines the "noindex" http header I've added to that folder. What to do now ?


Solution

  • I had to use Lambda@Edge function to edit those origin response headers while accessing those files via CloudFront URL. [ custom domain you've connected to your CloudFront distribution ]. From the response HTTP header, we have to remove x-amz-meta-header-from the 'keyname' of the user-defined header, So the crawlers will find X-Robots-Tag: noindex as HTTP header while accessing those files and follow its protocols. More information is available here