Search code examples
delphiwebbrowser-controlc++buildermshtml

Removing meta generator MSHTML


When I read the contents of the HTML page generated using TWebBrowser (design mode) for example using this code:

function GetHTML(w: TWebBrowser): String;
Var
  e: IHTMLElement;
begin
  Result := '';
  if Assigned(w.Document) then
  begin
     e := (w.Document as IHTMLDocument2).body;

     while e.parentElement <> nil do
     begin
       e := e.parentElement;
     end;

     Result := e.outerHTML;
  end;
end;

It adds the META tag just before the </HEAD>, for example:

<META content="MSHTML 6.00.2900.2180" name=GENERATOR>

or...

<META name=GENERATOR content="MSHTML 11.00.10570.1001">

Is there a way to get rid of the tag when reading outerHTML?

Or prevent MSHTML to generate it in the first place?

Or some other method to get rid of it?


Solution

  • As @Remy Lebeau has indicated you can't control this behaviour AFAIK. However it's easy to get rid of it if you want to.

    Personally I would use Regular Expressions (System.RegularExpressionsCore) which implements Perl Compatible Regular Expressions (PCRE) which has certainly been in the last several versions but I don't know when it was introduced.

    You will want to use a RegEx setting of something like:

      <META[^<]*GENERATOR\s*> 
    

    which matches all strings that start with <META does not have any > in the string and ends with GENERATOR(zero or more spaces)> You can set options for multi-line and case insensitive matching. Set the ReplaceString to be an empty string and then your code (I've used C++ as you tagged with C++ Builder) will look something like:

    TPerlRegEx     * pRegEx;
    
      pRegEx=new TPerlRegEx();
      pRegEx->Replacement=UnicodeString(L"");
      pRegEx->RegEx=UnicodeString(L"<META[^>]*GENERATOR\\s*>");
      pRegEx->Options=TPerlRegExOptions() << preCaseLess << pre MultiLine;
      pRegEx->Subject=szOuterHTML;
      pRegEx->ReplaceAll();
      delete(pRegEx);
    

    Of course there are other ways to do it, like use an XML node parser and remove the node, but I think a RegEx is clean and simple. It's a great tool when processing text files.

    If you Google Regular Expression Syntax you should find some good resources including online checkers to test if your expression is doing what you think it should.