Search code examples
regexapache.htaccessmod-rewriteurl-rewriting

Regex rewrite fix


I have the following 2 rewrite rules in my .htaccess file:

# Rule 1
RewriteRule ^(AL|AK|AZ|AR|CA|CO)/?([a-zA-Z-]+)?/?(faq|tagged)?/?([a-zA-Z0-9-]+)?$ /pages/seo.select-page.php?state=$1&location=$2&page=$3&title=$4 [NC,L]

# Rule 2
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^/.]+)/?$ /pages/seo.$1.php [L]

Overall it works fine, until my second rule needs to point to a page that starts with the same letters as rule1.

For example:

mydomain.co/copyright

in which case it defaults to:

mydomain.co/CO 

I've tried adding char count but it didn't seem to work.

(AL|AK|AZ|AR|CA|CO){2}

How can I fix this?


Solution

  • RewriteRule ^(AL|AK|AZ|AR|CA|CO)/?([a-zA-Z-]+)?/?(faq|tagged)?/?([a-zA-Z0-9-]+)?$ /pages/seo.select-page.php?state=$1&location=$2&page=$3&title=$4 [NC,L]
    

    In order to avoid the conflict with your example URL (example.com/copyright) you could simply remove the NC flag on the first rule to make it a case-sensitive match - specifically the first two uppercase characters. The 2nd and 3rd path-segments ("location" and "title") already match a-z and A-Z (so are already case-insensitive). So, this would only be an issue if faq or tagged could be requested as anything other than all lowercase.

    However, this rule is rather too generic as it allows any subsequent path segment to be omitted in any order, including the slashes (path delimiters)*1, which would result in an invalid rewrite (as in your example).

    (*1 It is the optional delimiters that allows /copyright to be successfully matched.)

    To allow the 2nd, 3rd and 4th path segments to be optional, but only in that order and importantly enforcing the slash delimiters, then you could adjust the regex to include a series of nested non-capturing subgroups:

    RewriteRule ^(AL|AK|AZ|AR|CA|CO)(?:/([a-zA-Z-]+)(?:/(faq|tagged)(?:/([a-zA-Z0-9-]+))?)?)?$ /pages/seo.select-page.php?state=$1&location=$2&page=$3&title=$4 [L]
    

    This now only permits a URL of the form /state or /state/location or /state/location/page or /state/location/page/title. (Except that the regex for "location" and "title" are very similar - so these could potentially be reversed. And the "location" regex would also match faq and tagged.)

    This would also avoid the conflict with a URL like /copyright with or without the NC flag, since the slash delimiter after the 2 character "state" code is mandatory if the "location" is present.

    However, as written, this modified rule does not allow a URL that ends in a trailing slash, which your earlier rule would permit.


    Aside:

    I've tried adding char count but it didn't seem to work.

    (AL|AK|AZ|AR|CA|CO){2}
    

    That isn't a "char count". {2} is a numeric quantifier that matches the preceding pattern exactly twice. So, in this case, it actually matches 4 uppercase characters. It is equivalent to:

    (AL|AK|AZ|AR|CA|CO)(?:AL|AK|AZ|AR|CA|CO)
    

    (Only the first group is capturing.)