Search code examples
screen-scrapinghtml-agility-pack

HTML Agility Pack errors


I am trying out the HTML Agility Pack for the first time and I am using an example section of code to parse out a URL in the HTML. But I am getting an error which I am not sure why I am getting it. Can someone point out ot me what I am doing wrong?

Here is the source (html is an incoming string of HTML):

 StringBuilder sb = new StringBuilder();

 HtmlDocument htmldoc = new HtmlDocument();
 htmldoc.LoadHtml(html);

 foreach (HtmlNode link in htmldoc.DocumentNode.SelectNodes("//a[@HREF]"))
     {
     HtmlAttribute att = link.Attributes["HREF"];
     sb.AppendLine(att.Value + "|");
     }
 return sb.ToString();

I am receiving the following error when I debug my app (debugger puts it right after the "foreach"):

System.NullReferenceException was unhandled
  Message=Object reference not set to an instance of an object.
  Source=ScreenScraper
  StackTrace:
       at ScreenScraper.its.GetITSLoadID(String html) in C:\Web_Projects\ScreenScaper\ScreenScraper\its.cs:line 22
       at ScreenScraper.frm1.btnStartScraping_Click(Object sender, EventArgs e) in C:\Web_Projects\ScreenScaper\ScreenScraper\frm1.cs:line 43
       at System.Windows.Forms.Control.OnClick(EventArgs e)
       at System.Windows.Forms.Button.OnClick(EventArgs e)
       at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
       at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
       at System.Windows.Forms.Control.WndProc(Message& m)
       at System.Windows.Forms.ButtonBase.WndProc(Message& m)
       at System.Windows.Forms.Button.WndProc(Message& m)
       at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
       at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
       at System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
       at System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
       at System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
       at System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
       at System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
       at System.Windows.Forms.Application.Run(Form mainForm)
       at ScreenScraper.Program.Main() in C:\Web_Projects\ScreenScaper\ScreenScraper\Program.cs:line 18
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.Runtime.Hosting.ManifestRunner.Run(Boolean checkAptModel)
       at System.Runtime.Hosting.ManifestRunner.ExecuteAsAssembly()
       at System.Runtime.Hosting.ApplicationActivator.CreateInstance(ActivationContext activationContext, String[] activationCustomData)
       at System.Runtime.Hosting.ApplicationActivator.CreateInstance(ActivationContext activationContext)
       at System.Activator.CreateInstance(ActivationContext activationContext)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssemblyDebugInZone()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: 

Solution

  • The Html Agility Pack has a "design bug" that returns a null for an empty collection. So you need to do this instead:

    HtmlNodeList list = htmldoc.DocumentNode.SelectNodes("//a[@HREF]");
    if (list != null)
    {
      foreach (HtmlNode link in list)
      ...
    }
    

    And by the way, all tags that are specified in the XPATH expression must be lowercase, even if they are declared differently in the HTML text (because HTML is case insensitive, the default Html Agility Pack XPATH convention is to use lowercase tags). So you should write this instead:

    HtmlNodeList list = htmldoc.DocumentNode.SelectNodes("//a[@href]");