Search code examples
javascriptregexregex-lookaroundslookbehind

Match a given regex except if a given word exist (lookahead or lookbehind)


I am using javascript regex to parse a series of URLs. I need to match a digit in a URL (it's actually more complicated, but I'm simplifying), but only want to match a number where a given word is not in the URL.

Namely, I want to exclude lines with the word 'changelogs' in them, and would therefore capture '1047', '1048', '1245' and '1049' from the following list;

http://www.opera.com/docs/changelogs/unified/1215/
http://www.whatever.com/docs/changelogs/anythingelse/anything/1215/
http://www.blabblah/security/advisory/1047
http://booger/security/advisory/1048/
ftp://msn.global.whatever/somethingelse/1245
whatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/

I know I need some kind of look-around look-ahead look-behind, but I'm striking out. Here is the last pattern I've tried;

(?!changelogs)(\d+)

Here is the regex101 sandbox I'm using.

Also, it's important that the only match is the actual number. I don't want anything else to match.


Here is what my .NET code looks like (note the "BulletinOrAdvisoryPattern" is the regex in question)...

Regex bulletinPattern = new Regex(@matchingDomain.Vendor.BulletinOrAdvisoryPattern, RegexOptions.IgnoreCase );
Match bulletinMatch = bulletinPattern.Match(referenceTitle);

                    if (bulletinMatch.Success)
                    {
                        //Found the bulletin ID in the NVD Reference Title 
                        return bulletinMatch.Value;
                    }

Solution

  • The "ugly" regex you need is

    (?<=http://www\.opera\.com\b(?!.*/changelogs(?:/|$))\S*)\d+
    

    See the .NET regex demo

    However, all you need is

    var result = input.Contains("/changelogs/") ? "" : input.Trim('/').Split('/').LastOrDefault();
    

    See the IDEONE C# demo:

    var lst = new List<string>() {"http://w...content-available-to-author-only...a.com/docs/changelogs/unified/1215/",
        "http://w...content-available-to-author-only...a.com/docs/changelogs/anythingelse/anything/1215/",
        "http://w...content-available-to-author-only...a.com/security/advisory/1047",
        "http://w...content-available-to-author-only...a.com/security/advisory/1048/",
        "http://w...content-available-to-author-only...a.com/doesnt/matter/could/be/anything/1049/"};
    lst.ForEach(m => Console.WriteLine(
            m.Contains("/changelogs/") ? "" : m.Trim('/').Split('/').LastOrDefault()
        ));
    

    UPDATE

    You switched the language from C# to JavaScript that changes the situation drastically since JS regex engine does not support a lookbehind.

    Thus, you have to work around it, and there are means to mimick the lookbehind, or just use capturing mechanism.

    If you can use capturing, try

    /^(?!.*\/changelogs(?:\/|$)).*\/(\d+)/
    

    See the regex demo

    var re = /^(?!.*\/changelogs(?:\/|$)).*\/(\d+)/gmi; 
    var str = 'http://www.opera.com/docs/changelogs/unified/1215/\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\nhttp://www.blabblah/security/advisory/1047\nhttp://booger/security/advisory/1048/\nftp://msn.global.whatever/somethingelse/1245\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/';
    var res = [];
     
    while ((m = re.exec(str)) !== null) {
      res.push(m[1]);
    }
    document.body.innerHTML = JSON.stringify(res, 0, 4);

    Or, use an optional group (if you are replacing):

    var re = /(\/changelogs\/.*)?\/(\d+)/gi; 
    var str = 'http://www.opera.com/docs/changelogs/unified/1215/\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\nhttp://www.blabblah/security/advisory/1047\nhttp://booger/security/advisory/1048/\nftp://msn.global.whatever/somethingelse/1245\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/';
    var result = str.replace(re, function (m, g1, g2){
      return g1 ? m : "NEW_VAL";
    });
    document.body.innerHTML = result;