Search code examples
.nethtmlmshtml

Looking for a good HTML parser that will provide offsetHeight like values


I have a project which requires me to load an HTML document as a string, and parse it. I am trying to determine which HTML node will exceed the height of a page (8.5x11) so I can insert a ‘page-break-after’ before it. This will be done with a .NET dll I am producing.

I have tried using the mshtml dom. It’s not easy to load a string value into this, and when I did manage to accomplish this the offsetHeight (etc) properties always return zero. The only way I have found to make this work is to save the HTML to disk, load it via SHDocVw.InternetExplorer, and then pass that to the mshtml dom.

I’m assuming that unless the HTML is ‘rendered’ by SHDocVw, I have no offsetHeight information for mshtml to report, as this is based on screen pixels. I could be wrong.

My current code is as follows:

Dim myIE As New SHDocVw.InternetExplorer
myIE.Navigate("D:\Temp\Test.HTML")
Dim myDoc As mshtml.HTMLDocument = CType(myIE.Document, mshtml.HTMLDocument)

Dim divTag As mshtml.IHTMLElement = myDoc.getElementById("someID")

For Each childNode As mshtml.IHTMLElement In TryCast(divTag.children, mshtml.IHTMLElementCollection)
    If childNode.offsetTop + childNode.offsetHeight > 750 Then '72pixels = 1 inch.
         childNode.insertAdjacentHTML("beforeBegin", "<DIV style='page-break-after:always'></DIV>") 
    End If
Next

I have two goals. #1 is key, #2 ideal.

1) Load the HTML from a string, and have the above code still work.

2) Idealy, find a .NET component that will do the same thing. I don’t like relying on COM components in .NET unless I have no choice.


Solution

  • WebBrowser (maybe, not sure) will take your HTML string and convert it to a navigable DOM. Reuse, don't reinvent an HTML parser. you'll have more hair left at the end of your project.