Search code examples
c#screen-scraping

C# Screen Scraper - Handle long uri's


I'm building a html screen scraper, which parses urls, and then compare those with a set of other urls.

The comparison is done with Uri.AbsoluteUri or Uri.Host.

My problem is that when i'm creating a new Uri (new Uri(url)), an UriFormatException is thrown when the url is to long, or contains to many slashes.

Since my predefined set of urls contains several (to) long urls, I cannot just use substring to only fetch a part of the url.

What would be the best way to handle this?

Thanks


Solution

  • You can use Uri.TryCreate to check if the URI is valid before you new it.

    You should not get an exception on a url this is so short. The folowing program runs well on VS2008:

    static void Main(string[] args)
    {
        Uri uri = new Uri("http://stackoverflow.com/questions/1298985/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/c-screen-scraper-handle-long-uris/");
        Uri uri2 = new Uri("http://stackoverflow.com/questions/1298985/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/");
        Console.ReadLine();
    }