Assuming I have the following urls:
https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html.html.html
https://mywebsite.com/pages.html.html.html?limit=24&start=8136&t=3358
https://mywebsite.com/pages.html.html?limit=24&start=8136&t=3358.html.html.html.html
How can I get rid of the repeated ".html" parts, leaving only one?
It's a mixed case, and after two hours of struggling, I can't still find a way using a proper regex to get this to work.
Here's what I tried:
RewriteEngine On
RewriteBase /
RewriteCond %{QUERY_STRING} ^(.*)((.html){2,})(.*)$
RewriteRule ^(.*)$ https://mywebsite.com/%1%4 [QSA,R=302,L]
I'm using 302 as it's a temporary workaround until I have a working solution to the root of this issue, but I keep getting a redirect loop.
I'd like to leave only ONE .html in place, removing all the multiple occurrences of it.
Example1:
https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html.html.html
should redirect to:
https://mywebsite.com/pages.html?limit=24&start=7440&t=3349.html
and
https://mywebsite.com/pages.html.html?limit=24&start=8136&t=3358.html.html.html.html
should redirect to:
https://mywebsite.com/pages.html?limit=24&start=8136&t=3358.html
Sorry to ask for this, but it was particularly tricky for me and I couldn't find a solution.
Thanks in advance.
RewriteCond %{QUERY_STRING} ^(.*)((.html){2,})(.*)$ RewriteRule ^(.*)$ https://mywebsite.com/%1%4 [QSA,R=302,L]
There are a few problems here:
/pages.html
)%1%4
) into the URL-path (not the query string). If should perhaps be ?%1%4
.QSA
flag). This ultimately causes the redirect loop..html
was only repeated once in the query string. (Could that happen, as it does appear to in the URL-path?)In your examples the multiple .html
always appears at the end of the URL and/or end of the query string. So, the trailing (.*)
in your regex would seem to be unnecessary (since nothing occurs after the duplicated .html
sequence).
Try the following instead:
# Multiple ".html" at end of query string
# (Also resolves multiple ".html" at end of URL-path - if any)
RewriteCond %{QUERY_STRING} (.+?\.html)(\.html)+$
RewriteRule (.+?\.html)(\.html)*$ /$1?%1 [NE,R,L]
# Multiple ".html" at end of the URL-path only
# (Query string + URL-path already handled by the above rule.)
RewriteRule (.+?\.html)(\.html)+$ /$1 [R,L]
With these 2 rules there is only at most 1 redirect. The first rule handles erroneous multiple .html
in the query string, but also corrects the URL-path at the same time. And the second rule handles the URL-path only (when the query string is already correct).
Note that the trailing ?
in .+?
(part of (.+?\.html)(\.html)*$
) makes the preceding quantifier non-greedy, so we consume as little as possible. In other words we only consume one instance of .html
in the first capturing group, rather than everything (or everything except the last instance of .html
as in the second rule).
Just a subtle difference in the regex between the first and second rules: (.+?\.html)(\.html)*$
and (.+?\.html)(\.html)+$
respectively. In the first (*
), the additional trailing .html
is optional, but in the second (+
) it is mandatory.
The QSA
flag is not required on either rule. In the first rule we rebuild the query string, so the original query string is discarded (by default). In the 2nd rule the query string (already OK) is passed through by default.
The NE
flag is used in the first rule since the backreference (as captured from the QUERY_STRING
server variable) is already URL encoded.
A single R
flag defaults to 302 (temporary), although for readability it can be beneficial to be explicit.