Search code examples
delphilazarusapache-tikaindy10

Upload Word File to extract Text via TIKA REST


I am trying to call Apache-TIKA via their REST API. I have successfully been able to upload a PDF document and return the document's text via CURL

curl -X PUT --data-binary @<filename>.pdf http://localhost:9998/tika --header "Content-type: application/pdf"

That translated to INDY like so:

function GetPDFText(const FileName: String): String;
var
  IdHTTP:  TIdHTTP;
  Params: TIdMultiPartFormDataStream;
begin
  IdHTTP := TIdTTP.Create;
  try
    Params := TIdMultiPartFormDataStream.Create;
    try
      Params.Add('file', FileName, 'application/pdf')
      Result := IdHTTP.PUT('http://localhost:9998/tika', Params);
    finally
      Params.Free;
    end;    
  finally
    IdHTTP.Free;
  end;
end;

Now I want to upload a word document (.docx) I assumed that all I would need to do is change the content Type when I add my file to Params, but that doesn't seem to produce any results, although I get no error reported back. I was able to get the following CURL command to work correctly

CURL -T <myDOCXfile>.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

How do I modify my HTTP call from CURL -X PUT to CURL -T?


Solution

  • There are at least two issues in your implementation:

    1. Your translation from CURL -X PUT to TIdHTTP is wrong.
    2. You don't specify Accept HTTP header to retrieve the extracted text in specific format.

    How to translate curl -X PUT to Indy?

    At first, lets make it clear that curl -X PUT --data-binary @<filename> <url> is the same as curl -T <filename> <url> when:

    • <url>'s scheme is HTTP or HTTPS
    • <url> does not end with /

    Therefore using one or the other shouldn't matter in your case. See also curl documentation.

    Secondly, TIdMultiPartFormDataStream is designed for use with POST verb, however nothing can stop you from passing it to TIdHTTP.Put, because it is indirectly derived from TStream. There even is a dedicated invariant of TIdHTTP.Post method that accepts TIdMultiPartFormDataStream:

    function Post(AURL: string; ASource: TIdMultiPartFormDataStream): string; overload;

    To upload file to the service just use TIdHTTP.Put method with TFileStream as an argument while providing proper content type of the file being uploaded in HTTP header.

    And finally you're trying to extract plain text from the document, but you didn't specify content type that the service should return. This is done via Accept HTTP header. Default instance of TIdHTTP has property IdHTTP.Request.Accept initialized to 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' (this may vary depending on Indy version). Therefore by default Tika will return HTML formatted text. To get the plain text you should change it to 'text/plain; charset=utf-8'.

    Fixed implementation:

    uses IdGlobal, IdHTTP;
    
    function GetDocumentText(const FileName, ContentType: string): string;
    var
      IdHTTP: TIdHTTP;
      Stream: TIdReadFileExclusiveStream;
    begin
      IdHTTP := TIdHTTP.Create;
      try
        IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
        IdHTTP.Request.ContentType := ContentType;
        Stream := TIdReadFileExclusiveStream.Create(FileName);
        try
          Result := IdHTTP.Put('http://localhost:9998/tika', Stream);
        finally
          Stream.Free;
        end;
      finally
        IdHTTP.Free;
      end;
    end;
    
    function GetPDFText(const FileName: string): string;
    const
      PDFContentType = 'application/pdf';
    begin
      Result := GetDocumentText(FileName, PDFContentType);
    end;
    
    function GetDOCXText(const FileName: string): string;
    const
      DOCXContentType = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
    begin
      Result := GetDocumentText(FileName, DOCXContentType);
    end;
    

    According to the Tika's documentation it also supports posting multipart form data. If you insist on using this approach, then you should change the target resource to /tika/form and switch to Post method in your implementation:

    function GetDocumentText(const FileName, ContentType: string): string;
    var
      IdHTTP: TIdHTTP;
      FormData: TIdMultiPartFormDataStream;
    begin
      IdHTTP := TIdHTTP.Create;
      try
        IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
        FormData := TIdMultiPartFormDataStream.Create;
        try
          FormData.AddFile('file', FileName, ContentType); { older Indy versions: FormData.Add(...) }
          Result := IdHTTP.Post('http://localhost:9998/tika/form', FormData);
        finally
          FormData.Free;
        end;
      finally
        IdHTTP.Free;
      end;
    end;
    

    Why does the original implementation in question work with PDF files?

    When you Post multipart form data via TIdHTTP, Indy automatically sets content type of the request to 'multipart/form-data; boundary=...whatever...'. This is not the case when you Put (unless you set it manually before performing the request) data and therefore TIdHttp.Request.ContentType remains blank. Now I can only guess that when Tika sees empty content type it falls back to some default type which could be PDF and it's still somehow able to read the document from multipart request.