Search code examples
pythonpdfsubprocesscorruptuudecode

Python Uudecode Call Corruption


I am working on extracting PDFs from SEC filings. They usually come like this:

SEC Filing Example

For whatever reason when I save the raw PDF to a .text file, and then try to run

uudecode -o output_file.pdf input_file.txt

from the python subprocess.call() function or any other python function that allows commands to be executed from the command line, the PDF files that are generated are corrupted. If I run this same command from the command line directly there is no corruption.

When taking a closer look at the PDF file being output from the python script, it looks like the file ends prematurely. Is there some sort of output limit when executing a command line command from python?

Thanks!


Solution

  • This script worked fine for me running under Python 3.4.1 on Fedora 21 x86_64 with uudecode 4.15.2:

    import subprocess
    subprocess.call("uudecode -o output_file.pdf input_file.txt", shell=True)
    

    Using the linked SEC filing (length: 173,141 B; sha1: e4f7fa2cbb3422411c2f2968d954d6bb9808b884), the decoded PDF (length: 124,557 B; sha1: 1676320e1d9923e14d19451c16688198bc93ca0d) appears correct when viewed.

    There may be something else in your environment causing the problem. You may want to add additional details to your question.

    Is there some sort of output limit when executing a command line command from python?

    If by "output limit" you mean the size of the file being written by uudecode, then no. The only type of "output limit" you need to worry about when using the subprocess module is when you pass stdout=PIPE or stderr=PIPE when creating a child process. If the child process writes enough data to either of these streams, and your script does not regularly drain them, the child process will block (see the subprocess module documentation). In my test, uudecode wrote nothing to stdout or stderr.