Search code examples
c#web-scrapinghtml-agility-pack

Trouble Scraping .HTM File


I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.

The page I am testing is: http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM

Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:

/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td

When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?

I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!

p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.


Solution

  • I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.

    When I do

     string test = string.Empty;
    StreamReader sr = new StreamReader(@"C:\gs.htm");
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(sr);
    sr.Close();
    sr = null;
    string xpath = @"//table[@id='Home']/tr[3]/td";
    test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
    

    That works fine.. returns a
    "COLUMBUS BLUE JACKETSGame 5 Home Game 3"
    which I hope is the string you wanted.

    Examining the html I couldn't find a /tbody.