Search code examples
.htaccesshttp-redirect

.htaccess redirect to remove repeated strings with query


Assuming I have the following urls:

https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html.html.html

https://mywebsite.com/pages.html.html.html?limit=24&start=8136&t=3358

https://mywebsite.com/pages.html.html?limit=24&start=8136&t=3358.html.html.html.html

How can I get rid of the repeated ".html" parts, leaving only one?

It's a mixed case, and after two hours of struggling, I can't still find a way using a proper regex to get this to work.

Here's what I tried:

RewriteEngine On
RewriteBase /
RewriteCond %{QUERY_STRING} ^(.*)((.html){2,})(.*)$
RewriteRule ^(.*)$ https://mywebsite.com/%1%4 [QSA,R=302,L]

I'm using 302 as it's a temporary workaround until I have a working solution to the root of this issue, but I keep getting a redirect loop.

I'd like to leave only ONE .html in place, removing all the multiple occurrences of it.

Example1:

https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html.html.html

should redirect to:

https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html

and

https://mywebsite.com/pages.html.html?limit=24&start=8136&t=3358.html.html.html.html

should redirect to:

https://mywebsite.com/pages.html?limit=24&start=8136&t=3358.html

Sorry to ask for this, but it was particularly tricky for me and I couldn't find a solution.

Thanks in advance.


Solution

  • RewriteCond %{QUERY_STRING} ^(.*)((.html){2,})(.*)$
    RewriteRule ^(.*)$ https://mywebsite.com/%1%4 [QSA,R=302,L]
    

    There are a few problems here:

    • You are discarding the original URL-path (ie. /pages.html)
    • You are moving the corrected query string (ie. %1%4) into the URL-path (not the query string). If should perhaps be ?%1%4.
    • The original "incorrect" query string is then appended again (regardless of the QSA flag). This ultimately causes the redirect loop.
    • This would fail if .html was only repeated once in the query string. (Could that happen, as it does appear to in the URL-path?)

    In your examples the multiple .html always appears at the end of the URL and/or end of the query string. So, the trailing (.*) in your regex would seem to be unnecessary (since nothing occurs after the duplicated .html sequence).

    Try the following instead:

    # Multiple ".html" at end of query string
    # (Also resolves multiple ".html" at end of URL-path - if any)
    RewriteCond %{QUERY_STRING} (.+?\.html)(\.html)+$
    RewriteRule (.+?\.html)(\.html)*$ /$1?%1 [NE,R,L]
    
    # Multiple ".html" at end of the URL-path only
    # (Query string + URL-path already handled by the above rule.)
    RewriteRule (.+?\.html)(\.html)+$ /$1 [R,L]
    

    With these 2 rules there is only at most 1 redirect. The first rule handles erroneous multiple .html in the query string, but also corrects the URL-path at the same time. And the second rule handles the URL-path only (when the query string is already correct).

    Note that the trailing ? in .+? (part of (.+?\.html)(\.html)*$) makes the preceding quantifier non-greedy, so we consume as little as possible. In other words we only consume one instance of .html in the first capturing group, rather than everything (or everything except the last instance of .html as in the second rule).

    Just a subtle difference in the regex between the first and second rules: (.+?\.html)(\.html)*$ and (.+?\.html)(\.html)+$ respectively. In the first (*), the additional trailing .html is optional, but in the second (+) it is mandatory.

    The QSA flag is not required on either rule. In the first rule we rebuild the query string, so the original query string is discarded (by default). In the 2nd rule the query string (already OK) is passed through by default.

    The NE flag is used in the first rule since the backreference (as captured from the QUERY_STRING server variable) is already URL encoded.

    A single R flag defaults to 302 (temporary), although for readability it can be beneficial to be explicit.