Search code examples
c#.nethtmlhtml-parsinghtml-agility-pack

How to get img/src or a/hrefs using Html Agility Pack?


I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.


Solution

  • The first example on the home page does something very similar, but consider:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
    doc.OptionEmptyCollection = true; // avoid null reference exception
    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
    {
       string href = link.Attributes["href"].Value;
       // store href somewhere
    }
    

    So you can imagine that for img@src, just replace each a with img, and href with src. You might even be able to simplify to:

    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
    {
        HtmlAttribute href = node.Attributes["href"];
        HtmlAttribute src = node.Attributes["src"];
        list.Add((href ?? src).Value);
    }
    

    For relative url handling, look at the Uri class.