Search code examples
c#web-scrapingweb-crawler

Any Good Open Source Web Crawling Framework in C#


Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.

I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.

I speak by my own personal experience.

I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.

If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.

If not am planning to extend this solution from code project and extend it.

http://www.codeproject.com/KB/IP/Crawler.aspx

If any one can suggest me a better path, I shall be really thankful.

EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.


Solution

  • PhantomJS + HtmlAgilityPack

    I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.

    This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.

    void Test()
    {
        var linkText = @"Help Spread DuckDuckGo!";
        Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
        // as of right now, this would print ‘https://duckduckgo.com/spread’
    }
    
    /// <summary>
    /// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
    /// its URL if found, otherwise an empty string.
    /// </summary>
    public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
    {
        using (IWebDriver phantom = new PhantomJSDriver())
        {
            phantom.Navigate.GoToUrl(pageUrl);
            var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
            if(link != null)
                return link.GetAttribute("href");
        }
        return string.Empty;
    }