Search code examples
python-3.xdropbox-apiapache-tika

Downloading file from Dropbox API for use in Python Environment with Apache Tika on Heroku


I'm trying to use Dropbox as a cloud-based file receptacle for an app/script. The script, written in Python, needs to take PDFs from the Dropbox and use the tika-python wrapper to convert to string.

I'm able to connect to the Dropbox API and use the files_download_to_file() method to download the PDFs to disk, and then use the tika from_file() method to pull that download file from the disk to process. Example:

# Download ex.pdf to local disk 
dbx.files_download_to_file('/my_local_path/ex_on_disk.pdf', '/my_dropbox_path/ex.pdf')

from tika import parser
parsed = parser.from_file('ex_on_disk.pdf')

The problem is that I'm planning on running this app on something like Heroku. I don't think I'm able to save anything locally and then access it again. I'm not sure how to get something from the Dropbox API that can be directly referenced by the tika wrapper to run the same as above. I think the PHP SDK has a file_get_contents and a file_put_contents set of methods but it doesn't appear to have a companion in the Python SDK.

I've tried using the shareable links in place of a filename but that hasn't worked. Any ideas? I know there's also the files_download method which downloads the FileMetadata object but I have no idea what to do with this and am having trouble finding more about it.

TLDR; How can I reference a file on Dropbox with a filename string such as 'example.pdf' to be used in another function that is trying to read a file from disk, without saving that Dropbox file to disk?


Solution

  • I figured it out. I used the files_download method to get the byte string and then use the from_buffer method of tika instead:

    md, response = dbx.files_download(path)
    file_contents = response.content
    
    parsed = parser.from_buffer(file_contents)