Search code examples
pythonapache-tika

422 error with Microsoft Documents using Tika API and Python


I am getting a 422 error when attempting to process Microsoft documents (.docx, .xlsx etc.) through the Tika REST API using the /tika endpoint in python.

I have tried resolving this issue by ensuring that the content type is correctly passed in the header and that a binary file is being passed to the endpoint.

Expect to see the contents of the .docx file printed. This code works with .pdf and .txt but none of the Microsoft extensions work.

def tika(files):
    url = 'https://[server_url]/tika'
    headers = {'Content-Type' : mimetype,'Cache-Control': 'no-cache'}
    r = requests.put(url, files=files, headers = headers)
    return r

if __name__ == "__main__":     

    from tkinter import filedialog
    from tkinter import *
    import json

    root = Tk()

    root.filename = filedialog.askopenfilename(parent=root,initialdir="/",title='Please select a file to scan')

    fin = open(root.filename, 'rb')

    files = {'files':fin}

    print ('Parsing File: ')

    mimetype = mimetypes.MimeTypes().guess_type(root.filename)[0]

    print (mimetype)

    r = tika(files)
    print (r.content)
    print(r.status_code)

Solution

  • I had to use the /tika/form endpoint and not declare a content-type in the header to get things to work. Apparently the requests python library is posting the file as a multi-part form.