Search code examples
c#web-scrapingwebpost

Issues trying to webscrap in C#


I am currently trying to webscrap some entries from a table of a website but when I make the get request, the string response does not include those entries that are shown in the website.

Here is the website: https://www.services-rte.com/en/view-data-published-by-rte/downtime-of-generation-resources.html

My guess is that I need to make a Post request to load the table but I can't find exactly what to post. Correct me if I am wrong.

Here is my code

static async void GetEntries()
    {
        var services = new ServiceCollection();
        services.AddHttpClient();
        var serviceProvider = services.BuildServiceProvider();
        var httpClientFactory = serviceProvider.GetService<IHttpClientFactory>();
        var client = httpClientFactory.CreateClient();

        string response = string.Empty;
        try
        {
            response = await client.GetStringAsync("https://www.services-rte.com/en/view-data-published-by-rte/downtime-of-generation-resources.html");
        }
        catch
        {
            Console.WriteLine("Site not found.");
            return;
        }

        var parser = new HtmlParser();
        var document = parser.ParseDocument(response);

        string content = string.Empty;
        for (int i = 1; i <= 20; i++)
        {
            try
            {
                Console.WriteLine(i);
                content = document.QuerySelector($"#wrapper > div > div > div.c-editorial-page__container > div.c-editorial-page__content > ctx-remit-generation-unavailability > cortex-remit-generation-unavailability-table > cortex-table > div > div.ctx__table_content > cortex-table-row:nth-child({i})").TextContent;
            }
            catch
            {
                Console.WriteLine($"CSS selector not found for {i}.");
                continue;
            }

            Console.WriteLine(content);
            Console.WriteLine("NEW");
        }
    }

Error in this line: content = document.QuerySelector($"#wrapper > div > div > div.c-editorial-page__container > div.c-editorial-page__content > ctx-remit-generation-unavailability > cortex-remit-generation-unavailability-table > cortex-table > div > div.ctx__table_content > cortex-table-row:nth-child({i})").TextContent;

Object reference not set to an instance of an object.


Solution

  • I think the data is loaded async? I mean the table on the website. I had this problem once: I did see the HTML on the website, but when I did a request via C# I couldn't find the HTML.

    What you can do is use something like Selenium. I know this might not be the best answer because I cannot really show you how to use it, but there is a plugin of Selenium you can use in C#. This can work with websites that load data async.

    Maybe this website can help you: https://www.scrapingdog.com/blog/web-scraping-with-csharp/ (not mine, but it looks promising).