Search code examples
pythonpython-3.xurllibapache-tikaurllib3

Python - urllib3 get text from docx using tika server


I am using python3, urllib3 and tika-server-1.13 in order to get text from different types of files. This is my python code:

def get_text(self, input_file_path, text_output_path, content_type):
    global config

    headers = util.make_headers()
    mime_type = ContentType.get_mime_type(content_type)
    if mime_type != '':
        headers['Content-Type'] = mime_type

    with open(input_file_path, "rb") as input_file:
        fields = {
            'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
        }

    retry_count = 0
    while retry_count < int(config.get("Tika", "RetriesCount")):
        response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
        if response.status == 200:
            data = response.data.decode('utf-8')
            text = re.sub("[\[][^\]]+[\]]", "", data)
            final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
            with open(text_output_path, "w+") as output_file:
                output_file.write(final_text)
            break
        else:
            if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
                return False
            retry_count += 1
    return True

This code works for html files, but when i am trying to parse text from docx files it doesn't work.

I get back from the server Http error code 422: Unprocessable Entity

Using the tika-server documentation I've tried using curl to check if it works with it:

curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

and it worked.

At the tika server docs:

422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc

This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.

I believe this is related to how I upload the file to the tika server, What am I doing wrong?


Solution

  • You're not uploading the data in the same way. --data-binary in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields causes urllib3 to generate a multipart/form-encoded message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type'] or simply pass body=input_file.read().