Search code examples
apache.htaccessmod-rewrite

Remove .html and .html/amp extension with .htaccess only for files from directory


After a site move I want to be able to remove the extension (if any) and query string (if any) to leave just the file name and keep the path

https://www.example.com/blog/anyfile.html 301 to >> https://www.example.com/blog/anyfile

https://example.com/blog/anyfile.html/amp 301 to >> https://www.example.com/blog/anyfile

https://www.example.com/blog/anyfile.html/amp?nonamp=1 301 to >> https://www.example.com/blog/anyfile

I tried something like this, but it doesn't keep the /blog/ folder:

RewriteEngine On
RewriteCond %{REQUEST_URI} ^/blog/
RewriteRule ^.*/([^/]+)\.html$ /$1? [L,NC,R]

also, I can't find a way to remove /amp after .html


Solution

  • Near the top of the root .htaccess file you could do something like the following to discard .html and .html/amp and .html/<anything> from the end of the URL-path. And discard the query string (if any) at the same time:

    # Strip ".html" onwards from the end of the URL (and remove query string)
    RewriteRule ^(.*)\.html(/.*)?$ https://www.example.com/$1 [QSD,R=301,L]
    

    The QSD (Query String Discard) flag is preferable to appending an empty query string in order to remove the query string on Apache 2.4+.

    You need to hardcode the scheme + hostname if you wish to satisfy your second example and redirect from example.com to www.example.com. This could be generalised (without hardcoding the domain) if we know that your site is only accessible by the www subdomain or domain apex and this single domain.

    However, the above won't catch URLs that only include a query string, but don't contain .html in the URL-path. For that you could implement an additional rule, following the rule above:

    # Strip the query string from any URL.
    RewriteCond %{QUERY_STRING} .
    RewriteRule ^ https://www.example.com%{REQUEST_URI} [QSD,R=301,L]
    

    A look at your existing rule:

    RewriteCond %{REQUEST_URI} ^/blog/
    RewriteRule ^.*/([^/]+)\.html$ /$1? [L,NC,R]
    

    You are only capturing the filename (anyfile in your example) and discarding the URL-path that precedes this (ie. blog/). So the $1 backreference only contains anyfile. This also only matches URLs that end in .html and not .html/amp.

    Checking the URL-path in the RewriteCond directive is superfluous.