Search code examples
regex.htaccessmod-rewrite

Matching numbers in the URL path directly following the domain name using a rewrite rule in .htaccess


I’m trying to clean up our SEO by catching non-canonical URLs that are being indexed by Google.

Here is a sample of one of our non-canonical URLs

https://www.umpqua.edu/184-about-ucc/facts-visitor-info?start=1 

I can catch it with this RegEx (see below) in the HTACCESS file but it also disabled other URLs that I want to work. It catches URLs with /NUMBER-. The number is two-three characters in length.

/([0-9]{2,3})-

So I'm trying to make it more unique. I have tried this (below) without success. My hope is to catch URLs with edu/NUMBER-

(edu)/([0-9]{2,3})-

I have also tried

(edu/)([0-9]{2,3})-

Here is my full HTACCESS entry:

RewriteCond %{REQUEST_URI} ^(edu)/([0-9]{2,3})-$
RewriteRule .* index.php [G]

Solution

  • adding "edu" is just me trying make the RegEx more selective. So when I was using this expression /([0-9]{2,3})- it worked well except it also matched with this url. /component/weblinks/weblink/239-external-links/… but it should not have.

    The significant thing about edu is that it is before the start of the URL-path. (But it's not part of the URL-path, it is the end part of the Host header.) In that case, just anchor the regex to the start of the URL-path. For example:

    RewriteRule ^\d{2,3}- - [G]
    

    This needs to go near the top of the root .htaccess file.

    \d is just short for [0-9]. Note there are 3 arguments in the above directive, separated by spaces:

    1. ^\d{2,3}- ... The pattern that matches against the URL-path
    2. - ... The substitution string (in this case a single hyphen)
    3. [G] ... The flags. In this case G for gone (short for R=410).

    The above will serve a "410 Gone" for any URL-path that starts with 2 or 3 digits followed by a hyphen. There is a single hyphen in the substitution string to explicitly indicate "no substitution". Using index.php here is superfluous since it is ignored.

    Note that there is no slash prefix on the URL-path matched by the RewriteRule pattern when used in .htaccess.

    You do not need a separate condition (RewriteCond directive) - the comparison can more easily/efficiently be performed in the RewriteRule directive itself.

    So the above will block /184-about-ucc/facts-visitor-info?start=1 but not /component/weblinks/weblink/239-external-links/..., since the 3 digits in the second URL do not occur at the start of the URL-path.