Search code examples
c#htmlhtml-tablescreen-scrapinghtml-agility-pack

How can I scrape data from a dynamic table using HTMLAgilityPack and C#


I have been trying numerous methods over the last few days to extract data from a table:

The Link to the website.

This is one version of code I found online and adapted. I have tried many methods unsure if the Xpath is correct or where the issue is occurring:

        private void button26_Click(object sender, EventArgs e)
        {
            //BCFERRIES 2

            // URL of the website containing the table
            string url = "https://www.bcferries.com/current-conditions/SWB-TSA/";

            // Load the HTML content from the URL
            HtmlWeb web = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = web.Load(url);

            //string tableXPath = "//table[@class='table-class']";
            //string tableXPath = "//*[@id=\"tabs-1\"]/div[1]/table";
            //string tableXPath ="/html/body/main/section[6]/div[1]/div/div[5]/div[1]/div[1]/table";
            //string tableXPath = "//*[@id=\"tabs-1\"]";
            //*[@id="tabs-1"]/div[1]/table/tbody
            //string tableXPath = "//div[@id='tabs-1']/div[1]/table";
            string tableXPath = "//div[@id='tabs']";

            // Get the table from the HTML document
            HtmlNode tableNode = doc.DocumentNode.SelectSingleNode(tableXPath);

            //TEST
            //HtmlNode firstChild = tableNode.FirstChild;
            //HtmlNode firstChild = tableNode.LastChild;
            //HtmlNode firstChild = tableNode.NextSibling;
            //MessageBox.Show(firstChild.OuterHtml);
            //MessageBox.Show(firstChild.InnerHtml);


            // Check if the table exists
            if (tableNode != null)
            {
                // Get all rows in the table
                //var rows = tableNode.SelectNodes(".//tr");
                var rows = tableNode.SelectNodes("./tr");

                // Iterate through each row and display the data
                foreach (var row in rows)
                {
                    //var cells = row.SelectNodes(".//td");
                    var cells = row.SelectNodes("./td");

                    if (cells != null)
                    {
                        foreach (var cell in cells)
                        {
                            richTextBox1.AppendText(cell.InnerText.Trim() + "\t");
                            //MessageBox.Show(cell.InnerText.Trim());
                        }
                        richTextBox1.AppendText("\n");
                        //MessageBox.Show("");
                    }
                }

            }
            else
            {
                MessageBox.Show("Table not found on the website.");
            }
        }

Each time I run the code, it either can't find the table, depending on the Xpath I use (I included many of my attempts with the Xpath), or if it finds the table it displays a blank messagebox when I attempt to see the first node, and then the programs fails trying to read the first row.

Any help would be appreciated....I am trying to see if I can read any of the time, boat or status fields before I build out the array or list for storing the data.

Thanks, Doug


Solution

  • The response from link via browser and code are different. So i tried to remove last slash from string url = "https://www.bcferries.com/current-conditions/SWB-TSA/"; And received result with a table.