Search code examples
pythondataframepdfapache-tikapdf-conversion

How to extract text from pdfs in folders with python and save them in dataframe?


I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.

I managed to extract text from one pdf file with tika package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.

# import parser object from tike 
from tika import parser   
  
# opening pdf file 
parsed_pdf = parser.from_file("ducument_1.pdf") 
  
# saving content of pdf 
# you can also bring text only, by parsed_pdf['text']  
# parsed_pdf['content'] returns string  
data = parsed_pdf['content']  
  
# Printing of content  
print(data) 
  
# <class 'str'> 
print(type(data))

The desired output should look like this:

Folder_Name pdf1 pdf2
17534 text of the pdf1 text of the pdf 2
63546 text of the pdf1 text of the pdf1
26374 text of the pdf1 -

Solution

  • If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners

    Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame

    #!/usr/bin/python3
    
    import os, glob
    from tika import parser 
    from pandas import DataFrame
    
    # What file extension to find, and where to look from
    ext = "*.pdf"
    PATH = "."
    
    # Find all the files with that extension
    files = []
    for dirpath, dirnames, filenames in os.walk(PATH):
        files += glob.glob(os.path.join(dirpath, ext))
    
    # Create a Pandas Dataframe to hold the filenames and the text
    df = DataFrame(columns=("filename","text"))
    
    # Process each file in turn, parsing with Tika and storing in the dataframe
    for idx, filename in enumerate(files):
       data = parser.from_file(filename)
       text = data["content"]
       df.loc[idx] = [filename, text]
    
    # For debugging, print what we found
    print(df)