Search code examples
pythonpython-3.xpostgresqlplpython

Python3 PyPDF2 - how to treat file handlers as BytesIO objects?


Have a nice, tested bit of python PyPDF2 code a .py designed to operate on 'real' OS files. Having debugged it all, I am now trying to incorporate it into a plPython function, replacing files with io.BytesIO() - or whatever mechanism would be the best candidate for seamless drop-in...

The file read/writes will now be to PostgreSQL bytea cols. Documents 'in' have been written with PG copy functions - byte counts match disk sizes; so far so good.

Original code expected files:

# infile = "myInputPdf.pdf"
# outfile = "myOutputPdf.pdf"

# inputStream  = open(infile, "rb")  # designed to open OS-based file
# --- Instead: 'document_in' loaded from PG bytea col:
inputStream = io.BytesIO(document_in)
# ---
pdf_reader = PdfFileReader(inputStream, strict=False)
# (lots of code in here, seems? to be working)
outputStream = io.BytesIO()   # trying it the python3 way!
pdf_writer.write(outputStream)

(I've assumed the objects should be treated as byte objects)

Finally:

plan3 = plpy.prepare("UPDATE documents SET document_out=$2 WHERE name=$1", ["varchar"]["varchar"])
ERROR:  TypeError: list indices must be integers, not str

(PostgreSQL 11.1, if it matters)

Have done similar things in the past using mkstemp techniques; trying now to grow up into the bytes world!


Solution

  • The second argument in plpy.prepare() is a list. The column type is bytea, not varchar. And you should use bytes (not a file object) to update the column:

    plan3 = plpy.prepare("UPDATE documents SET document_out=$2 WHERE name=$1", ["varchar", "bytea"])
    outputStream.seek(0)
    bytes_out = outputStream.read()
    plpy.execute(plan3, ['some name', bytes_out])