Search code examples
regexpcrenetscaler

RegEx matching URLs that are NOT in my domain


I am trying to set up my Netscaler device with a Rewrite Policy. One of my requirements is to replace any non-domain URLs with the home page URL... that is, I want the Netscaler to replace all external links on a page being served from behind the device with the home page's URL (ex: https://my.domain.edu). The type of Rewrite Policy I'm trying to configure uses a PCRE-compliant regex engine to find specific text on a web page (multiple matches possible).

good links:

https://your.page.domain.edu -- won't be replaced  
http://good.domain.edu  -- also won't be replaced

bad links (should be replaced with home page URL):

https://www.google.com    
http://not.the.best.example.org   
http://another.bad.example.erewhon.edu   
https://my.domain.com    

I currently have this pattern:

(https?://)(?![\w.-]+\.domain\.edu)

According to the Netscaler's RegEx evaluation tool this matches the bad links above and doesn't match the good links, so it seems to be working... in fact, when I run this on a test page, the Netscaler finds all the URLs I want to replace and leaves the good URLs alone.

The problem is the Netscaler isn't replacing the URLs the way I want: it replaces the (https?://) group with the home page URL but leaves the remaining part of the bad URL. For example, it replaces http://www.google.com with: https://my.domain.eduwww.google.com

I can configure the Rewrite Policy to replace specific URLs (for example, https://www.google.com), so I know the mechanism works. Obviously, this won't work for the general case.

I've tried enclosing the entire regex in parentheses, but this didn't change anything.

Can a regular expression be written for the general case, to match the entire URL for all domains that aren't mine?

Thanks in advance for any help!


Solution

  • You can use the following regex:

    ^https?:\/\/[\w.-]+(?<!\.domain\.edu)$
    

    with your home page URL as substitution:

    https://my.domain.edu
    

    TEST INPUT:

    https://www.google.com
    http://not.the.best.example.org
    http://another.bad.example.erewhon.edu
    https://my.domain.com
    https://your.page.domain.edu
    http://good.domain.edu
    

    TEST OUTPUT:

    https://my.domain.edu
    https://my.domain.edu
    https://my.domain.edu
    https://my.domain.edu
    https://your.page.domain.edu
    http://good.domain.edu
    

    Demo on regex101

    If http/https matters than use the following regex:

    ^(https?:\/\/)[\w.-]+(?<!\.domain\.edu)$
    

    with replacement:

    \1my.domain.edu
    

    INPUT:

    https://www.google.com
    http://not.the.best.example.org
    http://another.bad.example.erewhon.edu
    https://my.domain.com
    https://your.page.domain.edu
    http://good.domain.edu
    

    OUTPUT:

    https://my.domain.edu
    http://my.domain.edu
    http://my.domain.edu
    https://my.domain.edu
    https://your.page.domain.edu
    http://good.domain.edu
    

    Demo2