Search code examples
c#regexurl-parsing

Regex for URL C#


In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.

My actual Regex is:

(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)

This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332

I just want to get in this case the URL without the ?efdf=332 at the end.

So how should I change the regex?


Solution

  • http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+
    

    does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.

    In C#:

    Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")
    

    That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)