I'm trying to extract a certain portion of text from a pdf file. I've used PyPDF2
library to do that. However, when i excecute the script below I can see that the content I wish to grab is being printed in the console awkwardly.
I've written so far:
import io
import PyPDF2
import requests
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)
Output I'm having:
ACCESSHEALTHCTConnecticutAllPayersClaimsDatabaseDATASUBMISSIONGUIDE
December5,2013
Version1.2(withclarifications)
Output I wish to grab like:
ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)
I would suggest PDFMiner if installing other packages causes a dependency issue.
You can install it for python 3.7 by doing pip install pdfminer.six
, I've already tested and its working on my python 3.7.
The code for getting page 0 is as follows
import io
import requests
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
fp = io.BytesIO(res.content)
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
page_no = 0
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
if pageNumber == page_no:
interpreter.process_page(page)
data = retstr.getvalue()
print(data.strip())
Outputs
ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)
The good thing about PDFMiner is that it reads your pages directly and it focuses entirely on getting and analyzing text data.