Search code examples
.htaccesspdfgooglebot

Preventing indexing of PDF files with htaccess


I have a ton of PDF files in different folders on my website. I need to prevent them from being indexed by Google using .htaccess (since robots.txt apparently doesn't prevent indexing if other pages link to the files).

However, I've tried adding the following to my .htaccess file:

<Files ~ "\.pdf$">
Header append X-Robots-Tag "noindex, nofollow, noarchive, nosnippet"
</Files>

to no avail; the PDF files still show up when googling "site:mysite.com pdf", even after I've asked Google to re-index the site.

I don't have the option of hosting the files elsewhere or protecting them with a login system; I'd really like to simply get the htaccess file to do the job. What am I missing?


Solution

  • As I see in the comment made on another answer, I understand that you are looking for removing indexed file/folder which is already done by google. You can temporary forbid it using following if you stop anyone accessing directly.

    First, let me give you a workaround

    after that I will let you know what you can do which will be taking bit longer time.

    <Files "path/to/pdf/* ">  
    
        Order Allow,Deny
        Deny from all
        Require all denied
    </Files>
    

    this way all files/folders inside the given directory will be forbidden to use in the HTTP method. This means you can only access it programmatically for sending in attachment or deleting or something but the user will not be able to view these.

    You can make a script on your serverside which will access file internally and show file using parsing instead direct URL.(assuming data is critical as of now).

    Example

    $contents = file_get_contents($filePath);
    header('Content-Type: ' . mime_content_type($filePath));
    header('Content-Length: ' . filesize($filePath));
    echo $contents;
    

    Indexing vs Forbidding (No need of this now)

    Preventing indexing basically prevent this folder/files to be index by google bots or search engine bots, anyone visiting directly will still be able to view the file.

    In the case of Forbidding, no external entity/users/bots will able to see/access this file/folder.

    If you have recently forbidden access of your pdf folder, it may still be visible to Google until Googlebot visits again on your site and find those missing or you mention noindex for that specific folder.

    You can read more about crawler rate on https://support.google.com/webmasters/answer/48620?hl=en

    If you still want these to remove, you can visit the Google search console and request the same. visit: https://www.google.com/webmasters/tools/googlebot-report?pli=1