Search code examples
c#.netweb-scrapingcss-selectorshtml-agility-pack

How to scrape multiple selectors and group them


I'm wanting to scrape this page: https://www.g2crowd.com/products/google-analytics/reviews (For my own education)

    // @nuget: HtmlAgilityPack
using System;
using HtmlAgilityPack;

public class Program
{
    public static void Main()
    {
        HtmlWeb web = new HtmlWeb();
        HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
        var textNodes = html.DocumentNode.SelectNodes("//h3[contains(@class,'review-list-heading')]");
        if (textNodes != null)
            foreach (var t in textNodes)
                Console.WriteLine(t.InnerText);
    }
}

This is what I have so far, which pulls every review heading perfectly. But how in the world would I scape the heading & the review body - making it clear that each review is seperate?

The review "body" (meaning text) being: //*[@id="pjax-container"]/div[2]/div[2]/div[6]/div[3]/div/div/div[2]/div[2]/div/divin xpath.

Or <div itemprop="reviewBody"> in pure html.

This is a dotnetfiddle of what I have currently: https://dotnetfiddle.net/30Y0M6

Please ask if I'm not being clear enough.


Solution

  • select parent container which are <div class="mb-2 border-bottom"> then select the child

    // @nuget: HtmlAgilityPack
    using System;
    using HtmlAgilityPack;
    
    public class Program
    {
        public static void Main()
        {
            HtmlWeb web = new HtmlWeb();
            HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
            var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
            if (divNodes != null)
            {
                foreach (var child in divNodes)
                {
                    var allowedTags = child.SelectNodes(".//h3 | .//h5 | .//p");
                    foreach (var tag in allowedTags)
                        Console.WriteLine(tag.InnerText);
                    Console.WriteLine("======================================");
                }
            }
        }
    }