Search code examples
c#dompdfwebbrowser-controlabcpdf

Windows Forms WebBrowser control: DocumentText vs Document.Body.OuterHtml


I am trying to obtain html from the WebBrowser control, but it must include the value attributes of input elements on the page as well.

If I use webBrowser.DocumentText, I get the full HTML of the page as it was initially loaded. The input field values are not included.

If I use webBrowser.Document.Body.OuterHtml, I get the values, but not the other contents of (), which I need so I can get the stylesheet links, etc.

Is there a clean dependable way to obtain the full HTML of the DOM in its current state from the WebBrowser? I am passing the HTML to a library for it to be rendered to PDF, so suggestions for programmatically saving from the WebBrowser control to PDF will also be appreciated.

Thanks


Solution

  • There are some undocumented ways (changing registry, undocumented dll export) to print the document to XPS or PDF printers without parsing the page, that is, if your can afford to roll out required printer drivers to your customer's network.

    If you want to parse the web page, documentElement.outerHTML should give you the full canonicalized document, but not the linked image, script or stylesheet files. You need to parse the page, enumerate elements and check element types and get resource urls before digging the WinInet cache or downloading for additional resources. To get the documentElement property, you need to cast HtmlDocument.DomDocument to mshtml.IHTMLDocument2 if you use Windows Forms, or cast WebBrowser.Document to mshtml.IHTMLDocument2 if you use WPF. If you need to wait before the Ajax code finishes execution, starting a timer when the DocumentComplete event is raised.

    At this stage, I would parse the HTML DOM and get the necessary data in order to generate a report via a template, so you always have the option to generate other formats supported by the report engine, such as Microsoft Word. Very rarely I need to render the HTML as parsed, for example, printing a long table without adding customized header and footer on each page. That said, you can check Convert HTML to PDF in .NET and test which one of the suggested software/components works best with your target web site, if you do not have long tables.