Search code examples
delphiclipboardclipboarddata

Reading HTML content from Clipboard in Delphi


I have a webpage which has various tables on it. These tables are Javascript components, not just pure HTML tables. I need to process the text of this webpage (somewhat similar to screen scraping) with a Delphi program (Delphi 10.3).

I do a Ctrl-A/Ctrl-C to select all the webpage and copy everything to the clipboard. If I paste this into a TMemo component in my program, I am only getting text outside the table. If I paste into MS Word, I can see all the content, including the text inside the table.

I can paste this properly into TAdvRichEditor (3rd party), but it takes forever, and I often run out of memory. This leads me to believe that I need to directly read the clipboard with an HTML clipboard format.

I set up a clipboard HTML format. When I inspect the clipboard contents, I get what looks like all Kanji characters.

How do I read the contents of the clipboard when the contents are HTML?

In a perfect world, I would like ONLY the text, not the HTML itself, but I can strip that out later. Here is what I am doing now...

On initialization.. (where CF_HTML is a global variable)

CF_HTML := RegisterClipboardFormat('HTML Format');

then my routine is...

function TMain.ClipboardAsHTML: String;
var
  Data: THandle;
  Ptr: PChar;
begin
  Result := '';
  with Clipboard do
  begin
    Open;
    try
      Data := GetAsHandle(CF_HTML);
      if Data <> 0 then
      begin
        Ptr := PChar(GlobalLock(Data));
        if Ptr <> nil then
        try
          Result := Ptr;
        finally
          GlobalUnlock(Data);
        end;
      end;
    finally
      Close;
    end;
  end;
end;

** ADDITIONAL INFO - When I copy from the webpage... I can then inspect the contents of the Clipboard buffer using a free tool called InsideClipBoard. It shows that the clipboard contains 1 entry, with 5 formats: CT_TEXT, CF_OEMTEXT, CF_UNICODETEXT, CF_LOCALE, and 'HTML Format' (with Format ID of 49409). Only 'HTML Format' contains what I am looking for.... and that is what I am trying to access with the code that I have shown.


Solution

  • The HTML format is documented here. It is placed on the clipboard as UTF-8 encoded text, and you can extract it like this.

    {$APPTYPE CONSOLE}
    
    uses
      System.SysUtils,
      Winapi.Windows,
      Vcl.Clipbrd;
    
    procedure Main;
    var
      CF_HTML: Word;
      Data: THandle;
      Ptr: Pointer;
      Error: DWORD;
      Size: NativeUInt;
      utf8: UTF8String;
      Html: string;
    begin
      CF_HTML := RegisterClipboardFormat('HTML Format');
    
      Clipboard.Open;
      try
        Data := Clipboard.GetAsHandle(CF_HTML);
        if Data=0 then begin
          Writeln('HTML data not found on clipboard');
          Exit;
        end;
    
        Ptr := GlobalLock(Data);
        if not Assigned(Ptr) then begin
          Error := GetLastError;
          Writeln('GlobalLock failed: ' + SysErrorMessage(Error));
          Exit;
        end;
        try
          Size := GlobalSize(Data);
          if Size=0 then begin
            Error := GetLastError;
            Writeln('GlobalSize failed: ' + SysErrorMessage(Error));
            Exit;
          end;
    
          SetString(utf8, PAnsiChar(Ptr), Size - 1);
          Html := string(utf8);
          Writeln(Html);
        finally
          GlobalUnlock(Data);
        end;
      finally
        Clipboard.Close;
      end;
    end;
    
    begin
      try
        Main;
      except
        on E: Exception do
          Writeln(E.ClassName, ': ', E.Message);
      end;
      Readln;
    end.