I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
doc.OptionEmptyCollection = true; // avoid null reference exception
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
string href = link.Attributes["href"].Value;
// store href somewhere
}
So you can imagine that for img@src, just replace each a
with img
, and href
with src
.
You might even be able to simplify to:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
{
HtmlAttribute href = node.Attributes["href"];
HtmlAttribute src = node.Attributes["src"];
list.Add((href ?? src).Value);
}
For relative url handling, look at the Uri
class.