Search code examples
pythonpandascsvpdfplumber

Python & Pandas: combining multiple rows into single cell


I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling to figure out how to combine those rows into a single cell. I'm attempting Pandas pd.concat function to combine them, but so far without success.

Here's my code:

import pdfplumber
import pandas as pd
import csv

file1 = open("pdf_texts.csv", "w", newline="")
file2 = open("pdf_text_pgs.csv", "w", newline="")
writer2 = csv.writer(file2)
headers = ['text']

with pdfplumber.open('target.pdf') as pdf:
    pdf_length = len(pdf.pages)

    writer2.writerow(headers)

    for page_number in range(0, pdf_length):
        pdf_output = pdf.pages[page_number]
        pdf_txt = pdf_output.extract_text().encode('UTF-8')
        writer2.writerow([pdf_txt])

    # this is my attempt for pd.concat
    df  = pd.read_csv("pdf_text_pgs.csv", 'r')
    df_txts = df['text']
    pdf_txt_df = pd.concat([df_txts], axis=0, ignore_index=True)
    pdf_txt_df.to_csv('pdf_texts.csv', header=False, index=False)

However, the final output fails to combine the rows, and worse yet seems to lose the final row. Any suggestions on how to approach this? All help gratefully appreciated.


Solution

  • You would just need to store the text from each page in a list and combine it all at the end. For example:

    import pdfplumber
    import csv
    
    with pdfplumber.open('target.pdf') as pdf, \
         open("pdf_text_pgs.csv", "w", newline="", encoding="utf-8") as f_output:
    
        csv_output = csv.writer(f_output)
        csv_output.writerow(['text'])
    
        text = []
        
        for page in pdf.pages:
            extracted_text = page.extract_text()
            
            if extracted_text:  # skip empty pages or pages with images
                text.append(extracted_text)
            
        csv_output.writerow([' '.join(text)])