Search code examples
c#web-scrapinghtml-agility-pack

How to use HtmlAgilityPack to get specific data from stock website


I want to extract the number data from site, link https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml

the number in below yellow highlight image:

output

I want to extract the number highlighted in yellow, so I wrote this code in C#:

var html = @"https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml";
        HtmlWeb web = new HtmlWeb();
        web.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";
        var htmlDoc = web.Load(html);
        var node = htmlDoc.DocumentNode.SelectSingleNode("//*[@id='Listed_IncomeStatement_tableResult']/tbody/tr[1]/td[2]");
        string strSo = node.OuterHtml;

        Console.WriteLine(strSo);

but in strSo I cannot find the yellow number (19,749,872). Could you show me the way to extract the number in that website??? Sorry I write English not well.


Solution

  • You got a problem over this because the website is loading the data into the table via an AJAX request after the page is loaded, but HtmlAgilityPack can only download what the server directly send you.

    You can find out this by just looking at the source it downloads via HtmlWeb; in fact, the DocumentNode HTML in the Table tag with id "Listed_IncomeStatement_tableResult" has no data in tbody.

    To avoid this problem, you should use Selenium WebDriver.

    This extension allows to use some browser behaviour (Firefox or Chrome for example) that will execute the complete page with all the javascript inside of it, and then give you back the complete source of the page after it has been executed.

    Here you can find the driver to use Chrome: Chrome Driver

    After you imported all the libraries, you will have only to execute the following code:

    //!Make sure to add the path to where you extracting the chromedriver.exe:
    IWebDriver  driver = new ChromeDriver(@"Path\To\Chromedriver");
    driver.Navigate().GoToUrl("https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml");
    

    After that, you will be able to access to the webpage directly from driver object like:

    IWebElement myField = driver.FindElementBy.Id("tools"));
    

    The only problem you get with Chromedriver is that it will open up a browser to render everything. To avoid this, you can try to use another driver like PhantomJS, that will do the same as Chrome but will not open any window.

    To have more example on how to use Selenium WebDriver with C#, I reccomend you to get a look at:

    Selenium C# tutorial