I was trying to get a PDF from a webpage, parse it and print the result to the screen using PyPDF2. I got it working without issues with the following code:
with open("foo.pdf", "wb") as f:
f.write(requests.get(buildurl(jornal, date, page)).content)
pdfFileObj = open('foo.pdf', "rb")
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())
Writing a file just so I can then read it though sounded wasteful, so I figured I'd just cut the middleman with this:
pdf_reader = PyPDF2.PdfFileReader(requests.get(buildurl(jornal, date, page)).content)
page_obj = pdf_reader.getPage(0)
print(page_obj.extractText())
This, however yields me an AttributeError: 'bytes' object has no attribute 'seek'
. How can I feed the PDF coming from requests
directly onto PyPDF2?
You have to convert the returned content
to a file-like object using BytesIO
:
import io
pdf_content = io.BytesIO(requests.get(buildurl(jornal, date, page)).content)
pdf_reader = PyPDF2.PdfFileReader(pdf_content)