I am trying to call Apache-TIKA via their REST API. I have successfully been able to upload a PDF document and return the document's text via CURL
curl -X PUT --data-binary @<filename>.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
That translated to INDY like so:
function GetPDFText(const FileName: String): String;
var
IdHTTP: TIdHTTP;
Params: TIdMultiPartFormDataStream;
begin
IdHTTP := TIdTTP.Create;
try
Params := TIdMultiPartFormDataStream.Create;
try
Params.Add('file', FileName, 'application/pdf')
Result := IdHTTP.PUT('http://localhost:9998/tika', Params);
finally
Params.Free;
end;
finally
IdHTTP.Free;
end;
end;
Now I want to upload a word document (.docx) I assumed that all I would need to do is change the content Type when I add my file to Params, but that doesn't seem to produce any results, although I get no error reported back. I was able to get the following CURL command to work correctly
CURL -T <myDOCXfile>.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
How do I modify my HTTP call from CURL -X PUT to CURL -T?
There are at least two issues in your implementation:
CURL -X PUT
to TIdHTTP
is wrong.Accept
HTTP header to retrieve the extracted text in specific format.curl -X PUT
to Indy?At first, lets make it clear that curl -X PUT --data-binary @<filename> <url>
is the same as curl -T <filename> <url>
when:
<url>
's scheme is HTTP
or HTTPS
<url>
does not end with /
Therefore using one or the other shouldn't matter in your case. See also curl documentation.
Secondly, TIdMultiPartFormDataStream
is designed for use with POST
verb, however nothing can stop you from passing it to TIdHTTP.Put
, because it is indirectly derived from TStream. There even is a dedicated invariant of TIdHTTP.Post
method that accepts TIdMultiPartFormDataStream
:
function Post(AURL: string; ASource: TIdMultiPartFormDataStream): string; overload;
To upload file to the service just use TIdHTTP.Put
method with TFileStream
as an argument while providing proper content type of the file being uploaded in HTTP header.
And finally you're trying to extract plain text from the document, but you didn't specify content type that the service should return. This is done via Accept
HTTP header. Default instance of TIdHTTP
has property IdHTTP.Request.Accept
initialized to 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
(this may vary depending on Indy version). Therefore by default Tika will return HTML formatted text. To get the plain text you should change it to 'text/plain; charset=utf-8'
.
Fixed implementation:
uses IdGlobal, IdHTTP;
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
Stream: TIdReadFileExclusiveStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
IdHTTP.Request.ContentType := ContentType;
Stream := TIdReadFileExclusiveStream.Create(FileName);
try
Result := IdHTTP.Put('http://localhost:9998/tika', Stream);
finally
Stream.Free;
end;
finally
IdHTTP.Free;
end;
end;
function GetPDFText(const FileName: string): string;
const
PDFContentType = 'application/pdf';
begin
Result := GetDocumentText(FileName, PDFContentType);
end;
function GetDOCXText(const FileName: string): string;
const
DOCXContentType = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
begin
Result := GetDocumentText(FileName, DOCXContentType);
end;
According to the Tika's documentation it also supports posting multipart form data. If you insist on using this approach, then you should change the target resource to /tika/form
and switch to Post
method in your implementation:
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
FormData: TIdMultiPartFormDataStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
FormData := TIdMultiPartFormDataStream.Create;
try
FormData.AddFile('file', FileName, ContentType); { older Indy versions: FormData.Add(...) }
Result := IdHTTP.Post('http://localhost:9998/tika/form', FormData);
finally
FormData.Free;
end;
finally
IdHTTP.Free;
end;
end;
When you Post
multipart form data via TIdHTTP
, Indy automatically sets content type of the request to 'multipart/form-data; boundary=...whatever...'
. This is not the case when you Put
(unless you set it manually before performing the request) data and therefore TIdHttp.Request.ContentType
remains blank. Now I can only guess that when Tika sees empty content type it falls back to some default type which could be PDF and it's still somehow able to read the document from multipart request.