I am trying to obtain the text content of a Internet Explorer web browser window.
I am following these steps:
Edit
My problem is
I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,
<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>
In this scenario i wont be able to get "Hello World 1"
Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*. I am using C++ Win32, no MFC, ATL.
Thanks, Ashish.
If you iterate backwards on the document.body.all
elements, you will always walk on the elements inside out. So you don't need to walk recursive yourself. the DOM will do that for you. e.g. (Code is in Delphi):
procedure Test();
var
document, el: OleVariant;
i: Integer;
begin
document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
document.open;
document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
document.close;
for i := document.body.all.length - 1 downto 0 do // iterate backwards
begin
el := document.body.all.item(i);
// filter the elements
if (el.style.display = 'none') then
begin
el.removeNode(true);
end;
end;
ShowMessage(document.body.innerText);
end;
A Side Comment: As for your scenario with the recursive approach:
<div>Hello World 1<div style="display: none">Hello world 2</div></div>
If e.g. our element is the first DIV, el.getAdjacentText('afterBegin')
will return "Hello World 1"
. So we can probably iterate forward on the elements and collect the getAdjacentText('afterBegin')
, but this is a bit more difficult because we need to test the parents of each element for el.currentStyle.display
.