I'm developing a website with the Python Flask framework which will handle PDFs. I store the PDF files in MongoDB, which works fine when I need to serve them to visiting users. I now need to do some text and image extraction for which I use the pdfminer library. When I use pdf2txt.py and provide the file from the file system, this line (context here) works pretty much instantly:
for page in PDFPage.get_pages(file('ticket.pdf', 'rb'), pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): pass
but when I edit the code so that I provide the GridFS object from my MongoDB, the second line (so after the retreiving finished) takes about 8 seconds to succeed (the result is identical to the code above):
document = UserDocument.objects.first()
for page in PDFPage.get_pages(document.file_, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): pass
This kind of surprises me because I assumed that taking a file from my MongoDB or taking it from the file system would return an equal result (it renders the same in the browser), but apparently it is not the same.
Does anybody know what the difference between the two is which causes this call to take so long, and more importantly how I can solve it? All tips are welcome!
To answer my own question: It turns out that because strings are interned in Python, which means any string manipulation creates new strings which can get out of hand if you have multi-megabyte strings (i.e. repeatedly copying the "remainder" of a string to process into a new string would exhibit slowdown like that).
Apparently this highlights the fact that the pdfminer library is badly written. So I have two options:
Although option 1 would be the best option, I do not have the know how of this lib or time to learn this. So I opted for option 2 using the string buffer:
document = UserDocument.objects.first()
fp = StringIO()
fp.write(document.file_.read()) # Also takes about 0.8 sec, but thats still faster than 8 seconds.
for page in PDFPage.get_pages(file('ticket.pdf', 'rb'), pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): pass
This now takes about 1 second, which although still slow, is workable for now. If we are further in the development process we'll see if we can fork and improve the pdfminer library.