I wanted to use HTMLDocument
object from mshtml
library. I was trying to assign HTML to document:
var doc = new mshtml.HTMLDocument();
var html = File.ReadAllText(@"path_to_html_file");
doc.body.innerHTML = html; // <-- this line throws error
However, I get error on the third line:
System.NullReferenceException: 'Object reference not set to an instance of an object.'
mshtml.DispHTMLDocument.body.get returned null.
I was trying to use dynamic code, but it didn't work either:
dynamic doc = Activator.CreateInstance(Type.GetTypeFromProgID("htmlfile"));
In this case I get the following error:
Microsoft.CSharp.RuntimeBinder.RuntimeBinderException:
'Cannot perform runtime binding on a null reference'
Is there some solution to overcome this problem? Thanks!
Sub GetData()
Dim doc As MSHTML.HTMLDocument
Dim fso As FileSystemObject, txt As TextStream
Set doc = New MSHTML.HTMLDocument
Set fso = New FileSystemObject
Set txt = fso.OpenTextFile("path_to_html_file")
doc.body.innerHTML = txt.ReadAll() '// <-- No error here
txt.Close
End Sub
You could cast the mshtml.HtmlDocument
to the IHTMLDocument2 interface, to have the main objects' properties and methods available:
var doc = (IHTMLDocument2)new mshtml.HTMLDocument();
Or create a HtmlDocumentClass
instance using Activator.CreateInstance()
with the Type Guid, then cast to a IHTMLDocument2
Interface.
IHTMLDocument2 doc =
(IHTMLDocument2)Activator.CreateInstance(
Type.GetTypeFromCLSID(new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")));
It's more or less the same thing. I'ld prefer the first one, mainly for this reason
Then you can write to the HtmlDocument
whatever you want. For example:
doc.write(File.ReadAllText(@"[Some Html Page]"));
Console.WriteLine(doc.body.innerText);
To create a HtmlDocument, a skeleton HTML Page is enough, something like this:
string html = "<!DOCTYPE html><html><head></head><Body><p></body></html>";
doc.write(html);
Note: before a Document is created, all elements in the page will be null
.
After, you can set the Body.InnerHtml
to something else:
doc.body.innerHTML = "<P>Some Text</P>";
Console.WriteLine(doc.body.innerText);
Note that if you need to work with HTML Document more extensively, you'll have to cast to a higher level interface: IHTMLDocument3
to IHTMLDocument8
(as of now), depeding on the System version.
The classic getElementById
, getElementsByName
, getElementsByTagName
methods are availble in the IHTMLDocument3
interface.
For example, use the getElementsByTagName()
to retrieve the InnerText
of an HTMLElement
using it's tag name:
string innerText =
(doc as IHTMLDocument3).getElementsByTagName("body")
.OfType<IHTMLElement>().First().innerText;
Note:
If you can't find the IHTMLDocument6
, IHTMLDocument7
and IHTMLDocument8
interfaces (and possibly other interfaces referenced in the MSDN Docs), then you probably have an old Type library in the \Windows\Assembly\
GAC
. Follow Hans Passant's advices to create a new Interop.mshtml
library:
How to get mshtml.IHTMLDocument6 or mshtml.IHTMLDocument7?