Search code examples
regex.htaccessurl-rewriting

Splitting URL in three parts in htaccess - regex


I'm trying to split any URL that would show up on my website into three parts:

  1. Language (optional)
  2. Hierarchical structure of the page (parents)
  3. Current page

Right now I operate with 1 and 3 but I need to develop a way to allow for the pages to have the same names if they have different parents and therefore full URL is unique.

Here are the types of URL I may have:

(nothing)
en
en/test
en/parent/test
test
parent/test
ggparent/gparent/parent/test

I thought about extending my current directive:

RewriteRule ^(?:([a-z]{2})(?=\/))?.*(?:\/([\w\-\,\+]+))$ /index.php?lang=$1&page=$2 [L,NC]

to the following:

(?:([a-z]{2})(?=\/))?(.*)\/([^\/]*)?$

Which then I could translate to index.php?lang=$1&tree=$2&page=$3 but the difficulty I have is that the second capturing group captures the slash from the beginning.

I believe I can't (based on my search so far) dynamically have all the strings between slashes to be returned and make the last one to always be first, without repeating the same regex. I thought I would capture anything between language and current page and process the tree in PHP.

However my current regex has some problems and I can't figure them out:

  1. If language is on its own, it doesn't get captured
  2. The second group captures the slash betwen language and the tree

Link to Regex101: https://regex101.com/r/ecHBQT/1


Solution

  • This likely does it: Split the URL by slash into lang, tree, and page at the proper place, with all three parts possibly empty:

    RewriteRule ^([a-z]{2}\b)?\/?(?:\/?(.+)\/)?(.*)$ /index.php?lang=$1&tree=$2&page=$3 [L,NC]
    

    Testcase in JavaScript using this regex:

    const regex = /^([a-z]{2}\b)?\/?(?:\/?(.+)\/)?(.*)$/;
    [
      '',
      'en',
      'en/test',
      'en/parent/test',
      'test',
      'parent/test',
      'ggparent/gparent/parent/test'
    ].forEach(str => {
      let rewritten = str.replace(regex, '/index.php?lang=$1&tree=$2&page=$3');
      console.log('"' + str + '" ==>', rewritten);
    })

    Output:

    "" ==> /index.php?lang=&tree=&page=
    "en" ==> /index.php?lang=en&tree=&page=
    "en/test" ==> /index.php?lang=en&tree=&page=test
    "en/parent/test" ==> /index.php?lang=en&tree=parent&page=test
    "test" ==> /index.php?lang=&tree=&page=test
    "parent/test" ==> /index.php?lang=&tree=parent&page=test
    "ggparent/gparent/parent/test" ==> /index.php?lang=&tree=ggparent/gparent/parent&page=test
    

    Notes: