Search code examples
pythonapache-tikatika-server

Python Tika cannot parse pdf from url


python for parsing the online pdf for future usage. My code are below.

from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
    pdfFile = parser.from_file(open_pdf_file)
print(pdfFile)

However, it shows

AttributeError: '_io.BytesIO' object has no attribute 'decode'

I have taken an example from How can i read a PDF file from inline raw_bytes (not from file)?

In the example, it is using PyPDF2. But I need to use Tika as Tika has a better result than PyPDF2.

Thank you for helping


Solution

  • In order to use tika you will need to have JAVA 8 installed. The code that you'll need to retrieve and print contents of a pdf is as follows:

    from tika import parser
    
    url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
    
    pdfFile = parser.from_file(url)
    
    print(pdfFile["content"])