Search code examples
pythoncsvpdfpdfplumber

Python - inserting header into a csv


I'm developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header "text" for merging with another csv. Thus far my attempts to insert that header with csv_writer are running into difficulties.

For example, the code below successfully extracts and inserts the text from pdfs, but writes a new header for every file extracted:

import pdfplumber
import csv
import glob

pdfs = glob.glob("dir\*.pdf")

for pf in pdfs:
    with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:

        csv_output = csv.writer(f_output)
        csv_output.writerow(['text']) # code for inserting header
        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])

The other approach I've attempted is likewise unsuccessful. I tried to first write the header into the csv, and append the output of the loop to the csv. However, for some reason the formatting of the pdf output is completely disrupted, with text scattered across multiple cells instead of a single cell.

pdfs = glob.glob("dir\*.pdf")

# code for writing header
file = open("pdf_output.csv", "w", newline="")
writer = csv.writer(file)
headers = ['text']
writer.writerow(headers)

for pf in pdfs:
    with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:

        csv_output = csv.writer(f_output)

        text = []

        for page in pdf.pages:
            extracted_text = page.extract_text()

            if extracted_text:  
                text.append(extracted_text)

        csv_output.writerow([' '.join(text)])

Any suggestions on workarounds or better approaches for this challenge would be immensely welcome.


Solution

  • You could open the csv first, insert your header, then iterate through your PDFs:

    import pdfplumber
    import csv
    import glob
    
    pdfs = glob.glob("dir\*.pdf")
    
    with open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerow(['text'])
        
    for pf in pdfs:
        with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
     
            csv_output = csv.writer(f_output)
            text = []
    
            for page in pdf.pages:
                extracted_text = page.extract_text()
    
                if extracted_text:  
                    text.append(extracted_text)
    
            csv_output.writerow([' '.join(text)])
    

    Or just check if its the first iteration:

    import pdfplumber
    import csv
    import glob
    
    pdfs = glob.glob("dir\*.pdf")
    
    for i, pf in enumerate(pdfs):
        with pdfplumber.open(pf) as pdf, \
        open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
        
            csv_output = csv.writer(f_output)
            if i == 0: csv_output.writerow(['text'])
    
            text = []
    
            for page in pdf.pages:
                extracted_text = page.extract_text()
    
                if extracted_text:  
                    text.append(extracted_text)
    
            csv_output.writerow([' '.join(text)])