Trying to extract text from pdf file/s using python(v 3.8.2) module pypdf2(v 1.26.0). All good except with particular pdf file/s(generated from chrome print option.)
I have these files over the period that I have generated/downloaded using chrome's print option, where there is an option to save page/document as pdf. I am not able to extract text from these pdf files as code only returns ' '(empty), no problem with other pdf files. If you would like to test yourself you can save any web page as pdf using chrome print option and use that pdf to test. Chrome(v 81.0.4044.138)
Found that chrome uses Skia to save pages as pdf but didn't help to solve the problem. (PDF Producer: Skia/PDF m80)
Found following similar question on Stack Overflow but no body has answered yet and as I am new user I can't comment or add anything hence this new question.
Extract text from pdf converted from webpage using Pypdf2
Following is the code
import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
I am a new user and this is my first time posting question please correct me if I have done anything incorrect(not sure if I have). I assure you I have done my search on google found no solution or lacking knowledge to understand problem/solution. Thank you
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. which says:
While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.
Look at my answer for similar question here