Search code examples
pythonpython-3.xweb-scrapingpypdf

Can't make my script print output in the desired format


I'm trying to extract a certain portion of text from a pdf file. I've used PyPDF2 library to do that. However, when i excecute the script below I can see that the content I wish to grab is being printed in the console awkwardly.

I've written so far:

import io
import PyPDF2
import requests

URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)

Output I'm having:

ACCESSHEALTHCTConnecticutAllPayersClaimsDatabaseDATASUBMISSIONGUIDE
December5,2013
Version1.2(withclarifications)

Output I wish to grab like:

ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)

Solution

  • I would suggest PDFMiner if installing other packages causes a dependency issue.

    You can install it for python 3.7 by doing pip install pdfminer.six, I've already tested and its working on my python 3.7.

    The code for getting page 0 is as follows

    import io
    import requests
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfparser import PDFParser
    
    URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
    
    res = requests.get(URL)
    fp = io.BytesIO(res.content)
    
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    page_no = 0
    for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
        if pageNumber == page_no:
            interpreter.process_page(page)
    
            data = retstr.getvalue()
    
    print(data.strip())
    

    Outputs

    ACCESS HEALTH CT 
    
    Connecticut All Payers Claims Database 
    
    DATA SUBMISSION GUIDE 
    
    December 5, 2013 
    
    Version 1.2 (with clarifications) 
    

    The good thing about PDFMiner is that it reads your pages directly and it focuses entirely on getting and analyzing text data.