Search code examples
python-3.xcsvpdfpypdf

When using PyPDF2 for Python, how do I transfer data in CSV format to an existing PDF with blank form fields?


I am currently using the PyPDF2 extension with Python and have my data (which was originally a Google Form) and then downloaded as a CSV file and am hoping to copy this data into an existing PDF with similar fields as the original Google Form but would not be uniform. On the PyPDF2 website they offer some examples (https://pypdf2.readthedocs.io/en/3.0.0/user/forms.html) but it seems that they are creating an entirely new PDF to move the original data into rather than an existing PDF. Am I misreading their code?

This is the code I have so far. I know that the first few lines up until "reads an existing PDF file..." work and they display the CSV file as the list is meant to, but after that I have just inputed the code from PyPDF2's website and added some more descriptive comments to try to make sense of it. Does it make sense to use PyPDF2 to find the form fields in the existing PDF and use a for loop to iterate through the CSV file to insert matching information?

import csv
from PyPDF2 import PdfReader, PdfWriter

# opens csv file and returns a file object - type of file is “_io.TextIOWrapper”
file = open("AfF.csv")
csvreader = csv.reader(file)

# creates an empty list called header and obtains the header from each row
header = next(csvreader)
print(header)

# iterates through csvobject and append each row to the rows list
rows = []
for row in csvreader:
    rows.append(row)
print(rows)

# reads an existing PDF file "form.pdf" that contains fillable form fields
reader = PdfReader("form.pdf")
fields = reader.get_form_text_fields() # extracts text fields from the PDF form, stores extracted form fields in the "fields" variable
fields == {"key": "value", "key2": "value2"} # fields will contain a dictionary mapping field names (keys) to their corresponding curret values (value) in the PDF form


# fills out form fields in PDF "form.pdf" and saves the filled PDF as "filled-out.pdf"
reader = PdfReader("form.pdf") # instantiates "PdfReader" object "reader" for reading the existing PDF file
writer = PdfWriter() #instatiates "PdfWriter" object "writer" for creating a new PDF

page = reader.pages[0] # retrieves the first page of pdf
fields = reader.get_fields() # gets all form fields from the PDF

writer.add_page(page) # add the retrieved page (page) to the PdfWriter object (writer) using writer.add_page(page)

writer.update_page_form_field_values( # uses this to update the form field values on the first page (writer.page[0]) with a dictionary specifying field names and their new values {"fieldname": "some filled in text"} 
    writer.pages[0], {"fieldname": "some filled in text"} 
)

# write "output" to PyPDF2-output.pdf
with open("filled-out.pdf", "wb") as output_stream: # write the modified pdf to "filled-out.pdf" by opening a binary file "wb" and using writer.write(output_stream)
    writer.write(output_stream)

Solution

  • The example opens one PDF with a reader, then copies it over to a writer. This step is mandatory as you cannot open a PDF for "editing" with PyPDF2. The example code also saves it to another file, creating a copy on disk. I'd say this example follows the pattern of having a blank PDF with fields that serves as a template, and they expect you want to create filled-out copies based on dynamic data. I imagine that from your Google Forms data, you'd want one PDF per row of submitted form values. If so, read on.

    Also, if you don't need PyPDF2 specifically, consider pypdf: the effort that was going into PyPDF2 changed to pypdf. Reading from pypdf: Back to the Roots (2023-Today):

    In order to make things simpler for beginners, PyPDF2 was merged back into pypdf. Now all lowercase, without a number. We hope that the folks who develop PyPDF3 and PyPDF4 also join us.

    With that out of the way... I'd start out as simple as possible, and work your way up.

    I created this simple PDF (that you can download to follow along with) with just two fields, Name and Favorite color:

    Empty form

    I'll use pypdf to get the field names:

    from pypdf import PdfReader
    
    reader = PdfReader("form.pdf")
    page = reader.pages[0]
    fields = reader.get_fields()
    
    print(fields)
    

    and I get:

    {
        'Name': {'/T': 'Name', '/FT': '/Tx'},
        'Fav_color': {'/T': 'Fav_color', '/FT': '/Tx'}
    }
    

    The {'/T' ...} part doesn't matter, just the key names, Name and Fav_color.

    Then use that reader and try to update Name and Fav_color:

    from pypdf import PdfWriter
    
    writer = PdfWriter()
    writer.append(reader)
    
    fields = {"Name": "Alice", "Fav_color": "blue"}
    
    writer.update_page_form_field_values(
        writer.pages[0],
        fields,
        auto_regenerate=False,
    )
    
    with open("filled-out.pdf", "wb") as output_stream:
        writer.write(output_stream)
    

    I open filled-out.pdf, and it looks like:

    Filled-out PDF

    so that worked! I'd then try to bundle that up in a function that let me specify a new name, and the field values to use:

    def fill_out_pdf(new_name: str, fields: dict[str, str]):
        reader = PdfReader("form.pdf")
        page = reader.pages[0]
    
        writer = PdfWriter()
        writer.append(reader)
    
        writer.update_page_form_field_values(
            writer.pages[0],
            fields,
            auto_regenerate=False,
        )
    
        with open(new_name, "wb") as output_stream:
            writer.write(output_stream)
    
    
    fill_out_pdf("filled-out.pdf", {"Name": "Alice", "Fav_color": "blue"})
    

    and that looks the same as above.

    From there, I could move on to try and integrate dynamic data from a CSV:

    Name,Favorite color
    Alice,blue
    Bobbie,blue
    Charlie,vermilion
    

    To keep it simple in this example I'll use a csv.reader and map the CSV field positions in a row (0-based) to PDF field names, 0 → name, 1 → favorite color:

    import csv
    
    with open("input.csv", newline="") as f:
        reader = csv.reader(f)
        next(reader)  # skip header
        rows = list(reader)
    
    for row in rows:
        name = row[0]
        fav_color = row[1]
    
        new_name = f"{name}.pdf"
        fields = {"Name": name, "Fav_color": fav_color}
    
        fill_out_pdf(new_name, fields)
    

    When I run that, I get three PDFs, which look like:

    All PDFs filled out

    For all that, that's a very simple example: just a single PDF page, and no issues with the PDF itself.

    This kind of work can get tricky fast as problems with the PDF itself can mean any field can come out looking wrong. I worked on a project where one field in 300+ fields didn't render correctly in the saved, filled-out version: clearly not an issue with the Python program... just something deep, down in the PDF. So, be aware, and good luck!