Search code examples
pythonparsingpypdf

PyPDF hangs on big drawing


Here is the PDF I'm trying to parse. My code (below) hangs on the following line:

 content += " ".join(extract.strip().split())

It hangs on page 21, which is a big drawing. I wouldn't mind just skipping pages like this big drawing, but I'm not sure how to do this. Can anybody help me out?

 def ConvertPDFToText(self, pPDF):
    content = ""
    # Load PDF into pyPDF
    remoteFile = urlopen(pPDF).read()
    memoryFile = StringIO(remoteFile)
    pdf = PdfFileReader(memoryFile)
    print("Done reading")
    # Iterate pages
    try:
        numPages = pdf.getNumPages()
        print(str(numPages) + " pages detected")
        for i in range(0, numPages):
            # Extract text from page and add to content
            page = pdf.getPage(i)
            extract = page.extractText() + "\n"
            content += " ".join(extract.strip().split())

    except UnicodeDecodeError as ex:
        print(self._Name + " - Unicode Error extracting pages: " + str(ex))
        return ""
    except Exception as ex:
        print(self._Name + " - Generic Error extracting pages - " + str(ex))
        return ""
    # Decode the content. Since we don't know the encoding, we iterate through some possibilities.

    encodings = ['utf8', 'windows-1250', 'windows-1252', 'utf16', 'utf32']
    DecodedContent = ""
    for code in encodings:
        try:
            DecodedContent = content.decode(code)
            break
        except Exception as ex:
            continue
    return DecodedContent

Solution

  • Rather than use pyPdf which hasn't been updated since 2010, you should use PyPDF2, the newer fork of pyPdf. You can get it here:

    I just used it on your example PDF and it worked fine, although it took a while to parse that file. Here's the code I used:

    from PyPDF2 import PdfFileReader
    
    #----------------------------------------------------------------------
    def parse_pdf(pdf_file):
        """"""
        content = ""
        pdf = PdfFileReader(open(pdf_file, 'rb'))
        numPages = pdf.getNumPages()
        for i in range(0, numPages):
            # Extract text from page and add to content
            page = pdf.getPage(i)
            extract = page.extractText() + "\n"
            content += " ".join(extract.strip().split())
    
    if __name__ == "__main__":
        pdf = "Kicking Horse Mountain Park Construction 2014.pdf"
        parse_pdf(pdf)