Search code examples
.htaccess

Self referencing canonical tag in htaccess?


RewriteCond %{REQUEST_URI} !^/assets/pub/pdf-docs/.*$ [NC]

I need all the pdf files in the assets/pub/pdf-docs folder to have a self referencing canonical https header tag. How can I do this with one(ish) line(s) of code in the htaccess file?

I cannot apply it to just pdf files because the pdfs in assets/pvt/pdf-docs are excluded from indexing.

Many thanks


Solution

  • You could do it like this:

    RewriteEngine On
    
    # Set env var CANONICAL_URL 
    RewriteCond %{REQUEST_FILENAME} -f
    RewriteRule ^assets/pub/pdf-docs/.+\.pdf$ - [E=CANONICAL_URL:https://%{SERVER_NAME}%{REQUEST_URI}]
    
    Header add Link '<%{CANONICAL_URL}e>; rel="canonical"' env=CANONICAL_URL
    

    The mod_rewrite directives set an environment variable CANONICAL_URL if an existing .pdf file in the stated URL-path is requested. The Header directive then sets a rel="canonical" Link header, using this env var (ie. %{CANONICAL_URL}e), but only if this env var is set.

    In order to retrieve the canonical hostname, this is dependent on either the hostname already being canonicalised (ie. www vs non-www etc.) prior to these directives OR UseCanonicalName On and ServerName is set appropriately in the server config (otherwise SERVER_NAME is simply the same as HTTP_HOST - the value of the HTTP Host header). If this is not the case then hardcode the canonical hostname in place of %{SERVER_NAME}.

    Reference:


    Issue is that google is chosing old pdfs that are 404 to be canonical and ignoring correct PDFs. Correct PDFs are in sitemap.

    HOWEVER, as I stated in comments, setting this self-referential canonical tag on the new PDF is not going to help you - it's not going to prevent the old PDFs (that return a 404) from appearing in the search results.

    For that, you need to 301 (permanent) redirect the old PDFs to the new to inform search engines that the old PDFs have moved to a new URL (and to redirect users from the search engine's search results)*1. A separate sitemap containing only the "old" PDFs (that now redirect) can also help search engines with crawling the old URLs and discovering the redirect. Adding this sitemap to GSC will give you an idea of the index status of these old (and out of date) URLs.

    *1 This is assuming that the PDFs have simply changed URL and not entirely new and unrelated. In this case a redirect would not be appropriate and you should serve a "410 Gone" and request these old URLs be removed from the SERPs using Google's URL removal tool to expedite the removal process.

    (Adding a self-referential-canonical tag is only going to help if the same PDF is accessible from different URLs - but this is irrelevant to your current issue. And that isn't something you could necessarily do in .htaccess, unless you do it one-by-one for each PDF, or there is a discernible pattern that allows you to generate the canonical URL regardless of the request.)