Search code examples
pythonpdfmachine-learningnlppdf-scraping

How to return all extracted text from multiple PDFs in python?


This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling. How do I get it to return the extracted text the same way it prints it?

import os
import PyPDF2 as pdf
import pandas as pd

def scan_files(root):
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                for p in range(0, numPages):
                        pages = ''
                        page = pdf.getPage(p)
                        pages += page.extractText()
                        pages = pages.replace('\n', '')
                        #print(pages)
                        return pages

Solution

  • Printing the text will allow the last for loop to iterate(using the "print(pages)" you mentioned). However, returning pages will terminate the loops running and will spit out the text it covered so far. Try using something like:

    def scan_files(root):
        pdftext = ''
        for path, subdirs, files in os.walk(root):
            for name in files:
                if name.endswith('.pdf'):
                    #print(name)
                    pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                    numPages = pdf.getNumPages()
                    
                    pages = ''                    
    
                    for p in range(0, numPages):
                        page = pdf.getPage(p)
                        pages += page.extractText()
                        pages = pages.replace('\n', '')
    
                    pdftext += pages
    
        return pdftext