Search code examples
pythonpython-2.7pypdf

Extract first two lines of PDF with Python and pyPDF


I'm using python 2.7 and pyPDF to get the title meta info from PDF files. Unfortunately not all of PDF have the meta info. What I want to do now is grab the first two line of text from a PDF. Using what I have now how can I modify the code to capture the first two lines with pyPDF?

from pyPdf import PdfFileWriter, PdfFileReader
import os

for fileName in os.listdir('.'):
    try:
        if fileName.lower()[-3:] != "pdf": continue
        input1 = PdfFileReader(file(fileName, "rb"))

        # print the title of document1.pdf
        print fileName, input1.getDocumentInfo().title
    except:
        print ",", 

Solution

  • from PyPDF2 import PdfFileWriter, PdfFileReader
    import os
    import StringIO
    
    fileName = "HMM.pdf"
    try:
            if fileName.lower()[-3:] == "pdf": 
                input1 = PdfFileReader(file(fileName, "rb"))
    
                # print the title of document1.pdf
                #print fileName, input1.getDocumentInfo().title
    
                content = input1.getPage(0).extractText()
                buf = StringIO.StringIO(content)
                buf.readline()
                buf.readline()
    
    except:
            print ",", 
    

    My pwd contains this "HMM.pdf" file and this code is working on python 2.7 properly.