I am using python3
, urllib3
and tika-server-1.13
in order to get text from different types of files. This is my python code:
def get_text(self, input_file_path, text_output_path, content_type):
global config
headers = util.make_headers()
mime_type = ContentType.get_mime_type(content_type)
if mime_type != '':
headers['Content-Type'] = mime_type
with open(input_file_path, "rb") as input_file:
fields = {
'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
}
retry_count = 0
while retry_count < int(config.get("Tika", "RetriesCount")):
response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
if response.status == 200:
data = response.data.decode('utf-8')
text = re.sub("[\[][^\]]+[\]]", "", data)
final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
with open(text_output_path, "w+") as output_file:
output_file.write(final_text)
break
else:
if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
return False
retry_count += 1
return True
This code works for html files, but when i am trying to parse text from docx files it doesn't work.
I get back from the server Http error code 422: Unprocessable Entity
Using the tika-server
documentation I've tried using curl
to check if it works with it:
curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and it worked.
At the tika server docs:
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.
I believe this is related to how I upload the file to the tika server, What am I doing wrong?
You're not uploading the data in the same way. --data-binary
in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields
causes urllib3 to generate a multipart/form-encoded
message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type']
or simply pass body=input_file.read()
.