apache .htaccess mod-rewrite search-engine-bots

How to block content hotlinking except from google indexing with htaccess rules

I have prepared a .htaccess file and have placed it in a directory with pdf files to prevent hotlinking except from my site as follow:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteRule ([^/]+)\.(pdf)$ http://www.example.com/search_gcse/?q=$1 [NC,R,L]

This rule works as expected. If the link come from an external file, the request is redirected to my search page where the platform search for that (and similar) file.

So, when I search in Google, the results showed by google (which have been already indexed) are redirected to my search page (that's fine). Now, I'm concerned with the next time Google will indexes my site. So, I added a new rule as follow:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteCond %{HTTP_USER_AGENT} !(googlebot) [NC]
RewriteRule ([^/]+)\.(pdf)$ http://www.example.com/search_gcse/?q=$1 [NC,R,L]

However, I'm not sure if that rule is working, and what is the way to check it. If I try access a file from google search results, I'm still redirected to my search page, so it's not affect google search results.

Will this rule allow google to index my new pdf files, but prevent from a direct access from the google search result page ? If not, what is the correct way to achieved this?

Solution

While your htaccess rules will disallow hotlinking; it would not work well with the search indexers and other robots. The search engines would still be able to index your files.

In order to disallow search engines from indexing your files; you'd need to pass X-Robots-Tag header. Google provides a small documentation on how to prevent robots from indexing/caching/archiving a page it has crawled.

<Files ~ "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</Files>