Search code examples
.netscreen-scrapingscreenhtml-content-extraction

Looking for a free alternative to Webzinc .NET, screen scraping, web automation libraries for .NET


I came across this .NET library:

http://www.webzinc.com/online/faq.aspx

However, I was wondering if there was a free alternative out there?


Solution

  • Building robots isn't that hard, and there are a number of books that describe the general algorithm for doing so (a simple Google search will turn up a number of algorithms).

    The jist of it from a .NET perspecitve is to recursively:

    • Download pages - This is done through the HttpWebRequest/HttpWebResponse, or the WebClient classes. Also, you can use the new WCF Web API from CodePlex, which is a vast improvement over the above, meant specifically for producing/consuming REST content, it works wonderfully for spidering purposes (mainly because of it's extensibility)

    • Parse the downloaded content - I highly recommend the Html Agility Pack as well as the fizzler extension for the Html Agility Pack. The Html Agility Pack will handle malformed HTML and allow you to query HTML elements using XPath (or a subset of). Additionally, fizzler will allow you to use CSS selectors if you are familiar with using them in jQuery.

    • Once you have the HTML in a structured format, scan the structure for the content that is relevant to you and process it.

      • Scan the structured format for external links and place in the queue to be processed (against whatever constraints you want for your app, you aren't indexing the entire web, are you?).

      • Get the next item in the queue, and repeat the process again.