Search code examples
c#web-scrapingbrowser

c# screen scraping project - webbrowser not changing url


I'm working on a little automation project at the min and have hit a brick wall. Firstly i'd like to state the only reason i'm using webbrowser for this component of the project is the site being scraped has obfuscated code and requires a java enabled browser to display the code, i've got another app using webclient which works fine for other test sites but unfortunately can't be used on this target

My problem arises when trying to programatically configure the webbrowser control

First problem i've discovered is if i manually set the url in the controls properties it loads page 1 up and the scraper works for that page.
However, I proceeded to clear the url in the properties and set it manually in the Form1_Load method but it returns about:blank as the url despite the fact i've verified the automated parameter being pulled in is fine and should be getting set without issue

Here's what i'm using:

Note:
collection refers to an XML serialized array of definitions
definition refers to the active definition for this target,the idea being to configure this for multiple targets

    private void Form1_Load(object sender, EventArgs e)
    {
        PopulateScraperCollection();
        webBrowser1.Url = new Uri(collection.ElementAt(b).AccessUrl);
        NavigateToUrl(collection.ElementAt(b).AccessUrl);
    }

    public void PopulateScraperCollection()
    {
        string[] xmlFiles = Directory.GetFiles(@"E:\DealerConfigs\");
        foreach (string xmlFile in xmlFiles)
        {
            collection.Add(ScraperDefinition.Deserialize(xmlFile));
        }
    }

    public void NavigateToUrl(string url)
    {
        Console.WriteLine(collection.ElementAt(b).AccessUrl);
        webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
        webBrowser1.Navigate(webBrowser1.Url);
    }

    private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = sender as WebBrowser;
        Process(collection.ElementAt(b), 0);
        b++;
    }

Consequently this causes another issue in using DocumentCompleted to navigate to the paginated results. On the first page load i use a DocumentCompleted event to trigger the link extraction.
When I attempt to set the url for the the next page,which is being picked out fine using xpath and again verified, using F10 to step over in debug indicates it hasnt been changed and the DocumentCompleted event isn't being triggered

My code to change the url etc. is:

string nextPageUrl = string.Format(definition.NextPageUrlFormat, WebUtility.HtmlDecode(relativeUrl));
webBrowser1.Url = new Uri(nextPageUrl);
webBrowser1.DocumentCompleted += webBrowser1_DocumentCompleted;
webBrowser1.Navigate(webBrowser1.Url);

Any help as always is greatly appreciated, this is proving to be a nightmare to automate, not only because WebBrowser is so much slower than WebClient, but its proving a pain to alter on the fly

Regards

Barry


Solution

  • You should never really set webBrowser1.Url, You should just be using the Navigate void, so

    private void Form1_Load(object sender, EventArgs e)
    {
        PopulateScraperCollection();
        NavigateToUrl(collection.ElementAt(b).AccessUrl);
    }
    

    My guess would be why it isnt navigating, is that the collection.ElementAt(b).AccessUrl is null or about:blank

    Im not really sure how to answer your question, but the Navigate void should change it

    NB: WebBrowser control is proper crap, you could try another WebBrowser control like Awesomium or GeckoFX