Search code examples
.htaccesshttp-redirecthttp-status-code-301

htaccess: how to avoid repeated conditions/rules


Well known issue: Google indexing engine states it can see 2x2x2=8 duplicates of the URLs, where differences are

  • http - https
  • www - no www
  • root/ - root/index.php

(8 duplicates for the root URL, and 4 duplicates for every other page URL)

I use the following working code in the .htaccess to obtain 301-redirect for all of the duplicates:

RewriteEngine On

# first, www and root together
RewriteCond %{REQUEST_URI} ^/$
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ https://%1/index.php [R=301,L]

# remove www
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ https://%1/$1 [R=301,L]

# add index.php to the root url
RewriteCond %{REQUEST_URI} ^/$
RewriteRule ^(.*)$ https://%{HTTP_HOST}/index.php [R=301,L]

# finally, force https if none of the earlier conditions are met
RewriteCond %{HTTPS} off
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

The above code works well removing all duplicates with 301 redirect code. However, I believe it can be written a more elegant way, possibly without doubling rewrite condidtions/rules.

BTW I found hundreds(!) of posts giving advice and examples of the related .htaccess statements, and all of them are either incomplete or wrong! They usually stop after one condidtion is met, or don't result in 301 code in every cases.


Solution

  • Firstly (already discussed in comments), your canonical "root" URL should simply be /, not /index.php. Users should not see index.php in the URL and if the canonical is /index.php then you are going to have to always redirect users (and search engines) that type/request the root domain and share URLs including the extra index.php - not good for anyone. Your canonical link element for the root is then simply:

    <link rel="canonical" href="https://example.com/">
    

    And all internal links/anchors should state href="/", not href="/index.php".

    Otherwise, the only "minor" issue I see in the rules you've posted is treating the root differently in the www to non-www redirect and having a separate rule for this. Although having separate rules is generally fine, providing the number of redirects is kept minimal, which they appear to be here.

    Providing you are not implementing HSTS (in which case you would need to redirect to HTTPS first - on the same host - before other canonicalisation redirects) then you could combine these rules into one and minimise the number of redirects. For example:

    RewriteEngine On
    
    # Redirect HTTP and/or WWW and remove "index.php" (if any)
    RewriteCond %{HTTPS} off [OR]
    RewriteCond %{HTTP_HOST} ^www\. [NC,OR]
    RewriteCond %{REQUEST_URI} /index\.php$
    RewriteCond %{HTTP_HOST} ^(?:www\.)?(.*?)\.?$ [NC]
    RewriteRule ^(.*?)(?:(^|/)index\.php)?$ https://%1/$1$2 [R=301,L]
    

    I've kept the rules "generic" by not explicitly stating the domain name (but this then requires the 4th condition to get the hostname less the www subdomain). The rules could be "simplified" and would arguably be "more reliable" if the canonical hostname was hardcoded (depending on the server/environment).

    The regex ^(.*?)(?:(^|/)index\.php)?$ matches any URL, but excludes the optional index.php (or /index.php) on the end of the URL in the first capturing subpattern ($1). Note that this regex does not only handle /index.php in the root, but also subdirectories, eg. /foo/bar/index.php (which ultimately redirects to /foo/bar/). The $2 backreference simply contains the trailing slash when removing index.php from a subdirectory, otherwise it is empty.