I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr
I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling to figure out how to combine those rows into a single cell. I'm attempting Pandas pd.concat
function to combine them, but so far without success.
Here's my code:
import pdfplumber
import pandas as pd
import csv
file1 = open("pdf_texts.csv", "w", newline="")
file2 = open("pdf_text_pgs.csv", "w", newline="")
writer2 = csv.writer(file2)
headers = ['text']
with pdfplumber.open('target.pdf') as pdf:
pdf_length = len(pdf.pages)
for page_number in range(0, pdf_length):
pdf_output = pdf.pages[page_number]
pdf_txt = pdf_output.extract_text().encode('UTF-8')
# this is my attempt for pd.concat
df = pd.read_csv("pdf_text_pgs.csv", 'r')
df_txts = df['text']
pdf_txt_df = pd.concat([df_txts], axis=0, ignore_index=True)
pdf_txt_df.to_csv('pdf_texts.csv', header=False, index=False)
However, the final output fails to combine the rows, and worse yet seems to lose the final row. Any suggestions on how to approach this? All help gratefully appreciated.
You would just need to store the text from each page in a list and combine it all at the end. For example:
import pdfplumber
import csv
with pdfplumber.open('target.pdf') as pdf, \
open("pdf_text_pgs.csv", "w", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text: # skip empty pages or pages with images
csv_output.writerow([' '.join(text)])