Search code examples
pythonpython-2.7pdfpypdfpdfminer

Read pdf page by page


I searched for my question and did not get my answer in the two available questions

  1. Extract text per page with Python pdfMiner?

  2. PDFMiner - Iterating through pages and converting them to text

Basically I want to iterate over each page because I want to select only that page which has a certain text.

I have used pyPdf. It works for almost i can say 90% of the pdfs but sometimes it does not extract the information from a page.

I have used the below code:

import pyPdf
extract = ""        
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
  ex = pdf.getPage(6)
  ex = ex.extractText()
  if re.search(r"to be held (at|on)",ex.lower()):
    print 'yes'
    print  ex ,"\n"
    extract = extract + ex + "\n" 
    continue

The above code works but sometimes some pages don't get extracted.

I also tried using pdfminer, but i could not find how to iterate the pdf in it page by page. pdfminer returns the entire text of the pdf.

I used the below code:

def convert_pdf_to_txt(path):
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  fp = file(path, 'rb')
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  pagenos=set()

 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

   fp.close()
   device.close()
   retstr.close()
   return text

In the above code the text from the pdf comes from the for loop

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

In this how can I iterated on one page at a time.

The documentation on pdfminer is not understandable. Also there are many versions of the same.

So are there any other packages available for my question or can pdfminer be used for it?


Solution

  • I know it is not good to answer your own question but i think i may have figured out an answer for this question.

    I think it is not the best way to do it, but still it helps me.

    I used a combination of pypdf and pdfminer

    The code is as below:

    import pyPdf
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO
    
    path = "filename.pdf"
    pdf = pyPdf.PdfFileReader(open(path, "rb"))
    fp = file(path, 'rb')
    num_of_pages = pdf.getNumPages()
    extract = ""
    for i in range(num_of_pages):
      inside = [i]
      pagenos=set(inside)
      rsrcmgr = PDFResourceManager()
      retstr = StringIO()
      codec = 'utf-8'
      laparams = LAParams()
      device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
      interpreter = PDFPageInterpreter(rsrcmgr, device)
      password = ""
      maxpages = 0
      caching = True
      text = ""
      for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
        text = retstr.getvalue()
        text = text.decode("ascii","replace")
        if re.search(r"to be held (at|on)",text.lower()):
            print text
            extract = extract + text + "\n" 
            continue
    

    There may be a better way to do it, but currently i found out this to be pretty good.