Search code examples
regexurl-rewritingecma262

Regex not capturing repeating optional captures


I'm trying to write a URL rewrite regex for my company's site. The URL will always start with category/.+ After that, there can be up to 5 extra tags added on. With my current regex, it always captures the .+ after category, but then adds everything after that to that capture group. Example data

/category\/(.+)(?:\/(?:page|price|shipping|sort|brand)\/(.*))*/
mysite.com/category/15000000
mysite.com/category/15000000/page/2
mysite.com/category/15000000/page/2/price/g10l20
mysite.com/category/60000000/page/2/price//shipping//brand//sort/

The outcome is always

$1 = 15000000
    //desired $1 = 15000000
$1 = 15000000/page/2
    // desired $1 = 15000000 $2 = 2
$1 = 15000000/page/2/price/g10l20
    // desired $1 = 15000000 $2 = 2 $3 = g10l20
$1 = 60000000/page/2/price//shipping//brand//sort/
    // desired $1 = 60000000 $2 = 2 $3 = "" $4 = "" $5 = "" $6 = ""

My understanding is that the zero or more quantifier would enable it to go back, and search again for the "flag" pattern, but this is apparently not the case. Could someone please tell me what I'm doing wrong?


Solution

  • Unfortunately it's not possible to keep an indeterminate number of captures from a regex. When a capture is repeated with + * {n} etc, only the most recently captured group is returned.

    As you know you'll have a maximum of 5 tags, you could just repeat the relevant block 5 times like so:

    /category\/([^/]*)(?:\/(page|price|shipping|sort|brand)\/([^/]*))?(?:\/(page|price|shipping|sort|brand)\/([^/]*))?(?:\/(page|price|shipping|sort|brand)\/([^/]*))?(?:\/(page|price|shipping|sort|brand)\/([^/]*))?(?:\/(page|price|shipping|sort|brand)\/([^/]*))?/
    

    This is ugly in the extreme, allows a tag to be repeated, and needs the regular expression to be extended if you want to add more tags.

    The neatest solution is probably to capture the category ID in $1 and the rest of the argument string in $2 - you'll need to have the application parse this, where it can be done far more neatly than it can be in regex.

    /category\/([^/]*)(\/.*)?/