Search code examples
c#html-agility-pack

How to make asynchronous calls using HtmlAgilityPack?


I'm trying to get the table with id table-matches available here. The problem is that table is loaded using ajax so I don't get the full html code when I download the page:

string url = "http://www.oddsportal.com/matches/soccer/20180701/";

using (HttpClient client = new HttpClient())
{
    using (HttpResponseMessage response = client.GetAsync(url).Result)
    {
        using (HttpContent content = response.Content)
        {
            string result = content.ReadAsStringAsync().Result;
        }
    }
}

the html returned does not contains any table, so I tried to see if there is a problem of the library, infact I setted on Chrome (specifically on the Dev console F12) javascript off and same result on the browser.

Fox fix this problem I though to use a WebBrowser, in particular:

webBrowser.Navigate("oddsportal.com/matches/soccer/20140221/"); 
HtmlElementCollection elements = webBrowser.Document.GetElementsByTagName("table");

but I want ask if I can load also the full html doing asynchronus calls, someone has encountered a similar problem?

Could you please share a solution? Thanks.


Solution

  • The main issue with this page is that content inside table-matches is loaded via ajax. And neither HttpClient nor HtmlAgilityPack unable to wait for ajax to be executed. Therefore, you need different approach.

    Approach #1 - Use any headless browser like PuppeteerSharp

    using PuppeteerSharp;
    using System;
    using System.Threading.Tasks;
    
    namespace PuppeteerSharpDemo
    {
        class Program
        {
            private static String url = "http://www.oddsportal.com/matches/soccer/20180701/";
    
            static void Main(string[] args)
            {
                var htmlAsTask = LoadAndWaitForSelector(url, "#table-matches .table-main");
                htmlAsTask.Wait();
                Console.WriteLine(htmlAsTask.Result);
    
                Console.ReadKey();
            }
    
            public static async Task<string> LoadAndWaitForSelector(String url, String selector)
            {
                var browser = await Puppeteer.LaunchAsync(new LaunchOptions
                {
                    Headless = true,
                    ExecutablePath = @"c:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
                });
                using (Page page = await browser.NewPageAsync())
                {
                    await page.GoToAsync(url);
                    await page.WaitForSelectorAsync(selector);
                    return await page.GetContentAsync();
                }
            }
        }
    }
    

    In purpose of cleanness, I've posted output here here. And once you get html content you are able to parse it with HtmlAgilityPack.

    Approach #2 - Use pure Selenium WebDriver. Can be launched in headless mode.

    using OpenQA.Selenium;
    using OpenQA.Selenium.Chrome;
    using OpenQA.Selenium.Support.UI;
    using System;
    
    namespace SeleniumDemo
    {
        class Program
        {
            private static IWebDriver webDriver;
            private static TimeSpan defaultWait = TimeSpan.FromSeconds(10);
            private static String targetUrl = "http://www.oddsportal.com/matches/soccer/20180701/";
            private static String driversDir = @"../../Drivers/";
    
            static void Main(string[] args)
            {
                webDriver = new ChromeDriver(driversDir);
                webDriver.Navigate().GoToUrl(targetUrl);
                IWebElement table = webDriver.FindElement(By.Id("table-matches"));
                var innerHtml = table.GetAttribute("innerHTML");
            }
    
            #region (!) I didn't even use this, but it can be useful (!)
            public static IWebElement FindElement(By by)
            {
                try
                {
                    WaitForAjax();
                    var wait = new WebDriverWait(webDriver, defaultWait);
                    return wait.Until(driver => driver.FindElement(by));
                }
                catch
                {
                    return null;
                }
            }
    
            public static void WaitForAjax()
            {
                var wait = new WebDriverWait(webDriver, defaultWait);
                wait.Until(d => (bool)(d as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0"));
            }
            #endregion
        }
    }
    

    Approach #3 - Simulate ajax requests

    If you analyse the page loading using Fiddler or browser's profiler (F12) you can see that all data is coming with these two requests:

    fiddler requests oddsportal scraping So you can try to execute them directly using HttpClient. But in this case you may need to track authorization headers and maybe something else with each HTTP request.