Search code examples
pythonpdfpython-tesseract

Why is my code only creating a jpeg from the last page of the PDF and therefore only writing the last page to a text file?


I need to scrape a huge amount of text from a PDF for certain keywords then list those keywords on the pages they are found. I'm admittedly very new to Python and starting out by simply following a tutorial that scrapes from a PDF to a JPEG and writes it to text. However, I'm running into some problems even with this. My issue is that although I do seem to be able to turn some of this PDF into txt it only taking one page, the last page. My question is why? And how do I fix this?

Thanks

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 

PDF_file = "file2.pdf"
  
  
pages = convert_from_path(PDF_file, 500) 
  
image_counter = 1
  
for page in pages: 
  
   
    filename = "page_"+str(image_counter)+".jpg"
      
    page.save(filename, 'JPEG') 
  
    image_counter = image_counter + 1
  

filelimit = image_counter-1
  
outfile = "out_text.txt"
  

f = open(outfile, "a") 
  
for i in range(1, filelimit + 1): 
  
    
          
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
  
   
    text = text.replace('-\n', '')     
  
    f.write(text) 
  
f.close()

Solution

  • The problem is in the filename declaration.

    When the first loop finishes:

    for page in pages: 
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG') 
        image_counter = image_counter + 1
    

    Your filename variable set to the final image_counter. When you read the using filename variable you read the last image for 1 to filelimit + 1 time.

    One solution is re-declaring filename in the second-loop.

    for i in range(1, filelimit + 1): 
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename))))) 
        text = text.replace('-\n', '')     
        f.write(text) 
      
    f.close()
    

    That should solve the problem for reading each filename separately.