Search code examples
pythonpython-3.xpdfminer

Problem reading pdf to xml into memory using PDFMiner.Six


Consider the following snippet:

import io
result = io.StringIO()
with open("file.pdf") as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue()

This results in the following error

ValueError: Codec is required for a binary I/O output

If i leave out output_type i get the error

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.

I don't understand why this happens, and would like help with a workaround.


Solution

  • I figured out how to fix the problem: First you need to open "file.pdf" in binary mode. Then, if you want to read to memory, use BytesIO instead of StringIO and decode that. For example

    import io
    result = io.BytesIO()
    with open("file.pdf", 'rb') as fp:
        extract_text_to_fp(fp, result, output_type='xml')
    
    data = result.getvalue().decode("utf-8")