c#pdf screen-scraping html-parsing html-content-extraction

Screen-scraping for PDF links to download

I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else).

How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified location)? Sometimes a page will have a link to another HTML page which has the actual PDF link, so if the actual PDF can't be found on the first page I'd like it to automatically look for a link that have "PDF" in the text of a link, and then search that resulting HTML page for the real PDF link.

I know that I could probably achieve something similar via filetype searching through google, but that seems like "cheating" to me :) I'd rather learn how to do it in code, but I'm not sure where to start. I'm a little familiar with XML parsing with XElement and such, but I'm not sure how to do it for getting links from an HTML page (or other format?).

Could anyone point me in the right direction? Thanks!

Solution

HtmlAgilityPack is great for this kind of stuff.

Example of implementation:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).

The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).