Search code examples
delphidomcharacter-encodingdelphi-7twebbrowser

HTML source code from TWebBrowser - How to detect Stream encoding?


Based on this question: How can I get HTML source code from TWebBrowser

If I run this code with a html page that has Unicode code page, the result is gibberish becouse TStringStream is not Unicode in D7. the page might be UTF8 encoded or other (Ansi) code page encoded.

How can I detect if a TStream/IPersistStreamInit is Unicode/UTF8/Ansi?

How do I always return correct result as WideString for this function?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

If I replace TStringStream with TMemoryStream, and save TMemoryStream to file it's all good. It can be either Unicode/UTF8/Ansi. but I always want to return the stream back as WideString:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
var
  // LStream: TStringStream;
  LStream: TMemoryStream;
  Stream : IStream;
  LPersistStreamInit : IPersistStreamInit;
begin
  if not Assigned(WebBrowser.Document) then exit;
  // LStream := TStringStream.Create('');
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream,soReference);
    LPersistStreamInit.Save(Stream,true);
    // result := LStream.DataString;
    LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok
    Result := ??? // WideString
  finally
    LStream.Free();
  end;
end;

EDIT: I found this article - How to load and save documents in TWebBrowser in a Delphi-like way

Which does exactlly what I need. but it works correctlly only with Delphi Unicode compilers (D2009+). read Conclusion section:

There is obviously a lot more we could do. A couple of things immediately spring to mind. We retro-fit some of the Unicode functionality and support for non-ANSI encodings to the pre-Unicode compiler code. The present code when compiled with anything earlier than Delphi 2009 will not save document content to strings correctly if the document character set is not ANSI.

The magic is obviously in TEncoding class (TEncoding.GetBufferEncoding). but D7 does not have TEncoding. Any ideas?


Solution

  • I used GpTextStream to handle the convertion (Should work for all Delphi versions):

    function GetCodePageFromHTMLCharSet(Charset: WideString): Word;
    const
      WIN_CHARSET = 'windows-';
      ISO_CHARSET = 'iso-';
    var
      S: string;
    begin
      Result := 0;
      if Charset = 'unicode' then
        Result := CP_UNICODE else
      if Charset = 'utf-8' then
        Result := CP_UTF8 else
      if Pos(WIN_CHARSET, Charset) <> 0 then
      begin
        S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint);
        Result := StrToIntDef(S, 0);
      end else
      if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591)
      begin
        S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint);
        S := Copy(S, Pos('-', S) + 1, 2);
        if S = '15' then // ISO-8859-15 (Latin 9)
          Result := 28605
        else
          Result := StrToIntDef('2859' + S, 0);
      end;
    end;
    
    function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString;
    var
      LStream: TMemoryStream;
      Stream: IStream;
      LPersistStreamInit: IPersistStreamInit;
      TextStream: TGpTextStream;
      Charset: WideString;
      Buf: WideString;
      CodePage: Word;
      N: Integer;
    begin
      Result := ''; 
      if not Assigned(WebBrowser.Document) then Exit;
      LStream := TMemoryStream.Create;
      try
        LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
        Stream := TStreamAdapter.Create(LStream, soReference);
        if Failed(LPersistStreamInit.Save(Stream, True)) then Exit;
        Charset := (WebBrowser.Document as IHTMLDocument2).charset;
        CodePage := GetCodePageFromHTMLCharSet(Charset);
        N := LStream.Size;
        SetLength(Buf, N);
        TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage);
        try
          N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar);
          SetLength(Buf, N);
          Result := Buf;
        finally
          TextStream.Free;
        end;
      finally
        LStream.Free();
      end;
    end;