Search code examples
python-3.xcsvpdfadobefdf

How to parse CSV or FDF data to a Python dictionary and inject into a template PDF form?


I have client data in a CRM that I need to use in automation of PDF form completion.

I have figured out CRM extraction to CSV, post extraction CSV manipulation, correct CSV parsing into FDF format and creation and saving of the new PDF file that should be populated with the data from the FDF. The code is supposed to open up a template form with field names that match the CSV and FDF files, inject the values for the fields and save it as a new file.

The problem I seem to be stuck on is that I cannot get the FDF data into the template. When I open the new saved PDF the feilds are blank. I've played with a few different approaches but I'm so far down the rabbit hole now that I think I've lost the ability to be objective.

I'm using pypdf 4.1.0 library as I saw that the dev had rolled PyPDF2 back into it.

Here is the code that I am working on currently. I've tried many different approaches but this is the cleanest version that represents the current status:

import csv
import os
from pypdf import PdfWriter, PdfReader

# Read the CSV file
with open('acrobat_import.csv', 'r') as file:
    reader = csv.DictReader(file)
    data = list(reader)

# Function to create the FDF file content
def create_fdf_content(row):
    fdf_content = """
    
%FDF-1.2
%����
1 0 obj
<</FDF<</F(template_form.pdf)/Fields[
    <</T(courseType)/V/{courseType}>>
    <</T(gender)/V/{gender}>>
    <</T(dateofbirth)/V({dateofbirth})>>
    <</T(preferredLanguage)/V/{preferredLanguage}>>
    <</T(LastName)/V({lastName})>>
    <</T(firstName)/V({firstName})>>
    <</T(middleName)/V({middleName})>>
    <</T(addressStreet)/V({addressStreet})>>
    <</T(addressCity)/V({addressCity})>>
    <</T(addressState)/V({addressState})>>
    <</T(addressCountry)/V({addressCountry})>>
    <</T(addressPostalCode)/V({addressPostalCode})>>
    <</T(phoneNumber)/V({phoneNumber})>>
    <</T(emailAddress)/V({emailAddress})>>
    <</T(studentSigDate)/V({studentSigDate})>>
    <</T(courseStartDate)/V({courseStartDate})>>
    <</T(courseEndDate)/V({courseEndDate})>>
    <</T(classHours)/V({classHours})>>
    <</T(numberOfStudents)/V({numberOfStudents})>>
    <</T(courseState)/V({courseState})>>
    <</T(courseCity)/V({courseCity})>>
    <</T(courseLanguage)/V/{courseLanguage}>>
    <</T(instructorNumber)/V({instructorNumber})>>
    <</T(instructorLastName)/V({instructorLastName})>>
    <</T(instructorFirstName)/V({instructorFirstName})>>
    <</T(instructorSigDate)/V({instructorSigDate})>>
    <</T(courseFee)/V({courseFee})>>
    <</T(examinerNumber)/V({examinerNumber})>>
    <</T(examinerLastName)/V({examinerLastName})>>
    <</T(examinerFirstName)/V({examinerFirstName})>>
    <</T(examinerSigDate)/V({examinerSigDate})>>]
    >> >>
endobj
trailer
<</Root 1 0 R>>
%%EOF
""".format(**row)
    return fdf_content    
    
# Loop through each row, create an FDF file
for row in data:
    if row['lastName'] and row['firstName']:
        # Create the FDF file
        fdf_filename = f"{row['lastName']}_{row['firstName']} - {row['courseType']}.fdf"
        with open(fdf_filename, 'w', encoding='utf-8') as file:
            file.write(create_fdf_content(row))
        print(f"Created FDF file: {fdf_filename}")

# Open the template PDF
template_pdf = PdfReader("template_form.pdf", "rb")

# Create a dictionary from the fdf file
def parse_fdf_file(fdf_filename):
    fields = []
    with open(fdf_filename, 'r', encoding='utf-8') as fdf_file:
        lines = fdf_file.readlines()
        for line in lines:
            if line.startswith('<</T(') and '/V(' in line: # Does not account for radio button fields which have /V/ not /V()
                field_name = line.split('<</T(')[1].split(')')[0]
                field_value = line.split('/V(')[1].split(')')[0]
                fields.append({'field_name': field_name, 'field_value': field_value})
    return fields

# Import the FDF data into the template PDF
pdf_writer = PdfWriter()
page = template_pdf.pages[0]
fields = template_pdf.get_fields()
pdf_writer.add_page(page)
for field in fields:
    pdf_writer.update_page_form_field_values(0, field['field_name'], field['field_value'])

# Save the resulting PDF with the same name as the FDF file
pdf_filename = f"{row['lastName']}_{row['firstName']} - {row['courseType']}.pdf"
with open(pdf_filename, "wb") as pdf_file:
    pdf_writer.write(pdf_file)
print(f"Created PDF file: {pdf_filename}")

Validation of what I have here is that this code does create an FDF file in the correct format and I can manually import that FDF into the template form. That tells me that the FDF is being created correctly and therefore, the data, field names etc are correct and working.

I'm thinking that I may still be on an ineffective track with this though. I don't need and FDF file, I just need the data in the CSV file to go into the template form and be saved. The CSV has multiple rows with 1 clients data on each row. It occured to me while reading the pypdf docs that the FDF step is wasteful and quite possibly problematic and that the FDF form was not the same entity as a python dictionary.

So, I'm here to ask for input and help. What is the most efficient or at least the most effective way to parse the CSV and inject into the template form?

I should add that the template form has mostly text form fields but some radio button groups. As far as I can tell, the FDF only handles these differently with the value for a text field preceded by /V( and a radio button value by /V/ . I haven't yet fixed lines 75:77 as I wanted to get the feedback from here first.

I did read this post and a whole lot of others but I'm not looking to create a whole PDF, I need to utilise a template form with fields that must be populated.

EDIT/UPDATE: Switched to parsing the CSV directly but getting 'Error: key must be PdfObject':

import csv
from pypdf import PdfReader, PdfWriter

# Define CSV and PDF template file names
CSV_FILE = "acrobat_import.csv"
PDF_TEMPLATE = "pdf_import_template.pdf"

def process_csv_pdf(csv_file, pdf_template):
    try:
        # 1. Read the CSV data into a dictionary
        with open(csv_file, 'r', newline='') as file:
            reader = csv.DictReader(file)
            csv_data = list(reader)           # Read the data as dictionaries

            print("CSV Data:", csv_data)

    except FileNotFoundError as e:
        print(f"Error: CSV file '{csv_file}' not found.")
        return
    except Exception as e:
        print(f"Error reading CSV file: {e}")
        return

    # 2. Process each row in the CSV
    for row in csv_data:
        try:
            # 3. Prepare variables for file naming
            last_name = row.get('lastName', '')
            first_name = row.get('firstName', '')
            course_type = row.get('courseType', '')

            if not all([last_name, first_name, course_type]):
                print(f"Warning: Skipping row due to missing data: {row}")
                continue  # Skip rows with missing fields

            # 4. Create output file name
            output_filename = f"{last_name}_{first_name} - {course_type}.pdf"
            print(f"Processing: {output_filename}")

            # 5. Fill the PDF template
            reader = PdfReader(pdf_template)
            writer = PdfWriter()
            page = reader.pages[0]
            
             # Get form fields 
            fields = reader.get_fields(0) 
            if fields:
                field_map = {field.get("/T"): field for field in fields.values()} # Create a mapping

                for field_name in row:  # Iterate over your CSV keys directly
                    if field_name in field_map:
                        pdf_field = field_map[field_name]
                         # Debugging: 
                        print(type(pdf_field))  # Print the type of the PdfObject
                        print(pdf_field)        # Print the representation of the field object

                        pdf_field["/V"] = row[field_name]  # Direct update attempt
                        writer.update_page_form_field_values(writer.add_page(page), pdf_field.get("/T"), row[field_name]) # Note: Using .get("/T") for the key



            # 6. Save the output PDF
            with open(output_filename, 'wb') as output_file:
                writer.write(output_file)

            print(f"PDF created: {output_filename}")

        except Exception as e:
            print(f"Error processing row: {row}, Error: {e}")

if __name__ == "__main__":
    process_csv_pdf(CSV_FILE, PDF_TEMPLATE) 

Solution

  • This could be a rather straightforward operation to open the CSV, iterate each row, and fill in a new PDF with PyPDF... but it will probably have some kind of problem... 'cause PDFs.

    From the latest PyPDF documentation on filling-out a form, I made a script which will read an input CSV, that looks like:

    | First_name | Last_name | Home_address   | County     | Birthday  |
    |------------|-----------|----------------|------------|-----------|
    | Foo        | Bar       | 123 Main St    | Multnomah  | 1/1/2000  |
    | Baz        | Baker     | 456 Second Ave | Washington | 6/21/1999 |
    

    then iterate each row and call a function to open the original PDF, copy it (based on First_name), and fill it in. I used the csv module's DictReader because I can just pass the dict row directly to the update_page_form_field_values method:

    import csv
    from pypdf import PdfReader, PdfWriter
    
    
    def new_pdf_from(row: dict[str, str]):
        reader = PdfReader("form.pdf")
        writer = PdfWriter()
        writer.append(reader)
    
        writer.update_page_form_field_values(
            writer.pages[0],
            row,
            auto_regenerate=False,
        )
    
        with open(f"output-{row['First_name']}.pdf", "wb") as output_stream:
            writer.write(output_stream)
    
    
    with open("data.csv", newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            new_pdf_from(row)
    

    PDFs filled out