Search code examples
c++winapihtml-parsingolemshtml

Obtaining visible text on a page from an IHTMLDocument2*


I am trying to obtain the text content of a Internet Explorer web browser window.

I am following these steps:

  1. obtain a pointer to IHTMLDocument2
  2. from the IHTMLDocument2 i obtain the body as an IHTMLElement
    3. On the body i call get_innerText

Edit


  1. I obtain all the children of the body and try to do a recursive call on all the IHTMLElements
  2. if i get any element which is not visible or if i get an element whose tag is script, i ignore that element and all its children.

My problem is

  1. that along with the text which is visible on the page i also get content having for which style="display: none"
  2. For google.com, i also get javascript along with the text.

I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,

<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>

In this scenario i wont be able to get "Hello World 1"

Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*. I am using C++ Win32, no MFC, ATL.

Thanks, Ashish.


Solution

  • If you iterate backwards on the document.body.all elements, you will always walk on the elements inside out. So you don't need to walk recursive yourself. the DOM will do that for you. e.g. (Code is in Delphi):

    procedure Test();
    var
      document, el: OleVariant;
      i: Integer;
    begin
      document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
      document.open;
      document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
      document.close;
      for i := document.body.all.length - 1 downto 0 do // iterate backwards
      begin
        el := document.body.all.item(i);
        // filter the elements
        if (el.style.display = 'none') then
        begin
          el.removeNode(true);
        end;
      end;
      ShowMessage(document.body.innerText);
    end;
    

    A Side Comment: As for your scenario with the recursive approach:

    <div>Hello World 1<div style="display: none">Hello world 2</div></div>
    

    If e.g. our element is the first DIV, el.getAdjacentText('afterBegin') will return "Hello World 1". So we can probably iterate forward on the elements and collect the getAdjacentText('afterBegin'), but this is a bit more difficult because we need to test the parents of each element for el.currentStyle.display.