Search code examples
pythonpdfsanitization

Sanitizing PDF user input in Python


My app allows users to upload a PDF file. The files should look relatively similar and be variants of the same format. I understand that PDFs can sometimes contain malicious content -- e.g. Javascript that'll be executed when it's opened in Adobe Reader or similar.

I've seen a few packages online, for example PDFiD that helps you look at potentially questionable PDFs. This one appears to allow you to see all the underlying content types. My current thought process is to get an idea of what content types my documents should contain, then block the files from being uploaded if they have unusual content.

Is there a simple way to use Python to automatically scrub a PDF of malicious content, removing all executable code that it may contain? I know there is a PDF/A format which allows for something like this, but is there any package like PyPDF2 that has a sanitize function?


Solution

  • I believe this to be the answer:

    from pdfid import PDFiD
    new_file = PDFiD('path/to/file', disarm=True)
    

    Will take the elements of the PDF

    <Keywords>
        <Keyword Count="56" HexcodeCount="0" Name="obj"/>
        <Keyword Count="56" HexcodeCount="0" Name="endobj"/>
        <Keyword Count="32" HexcodeCount="0" Name="stream"/>
        <Keyword Count="32" HexcodeCount="0" Name="endstream"/>
        <Keyword Count="1" HexcodeCount="0" Name="xref"/>
        <Keyword Count="1" HexcodeCount="0" Name="trailer"/>
        <Keyword Count="1" HexcodeCount="0" Name="startxref"/>
        <Keyword Count="8" HexcodeCount="0" Name="/Page"/>
        <Keyword Count="0" HexcodeCount="0" Name="/Encrypt"/>
        <Keyword Count="0" HexcodeCount="0" Name="/ObjStm"/>
        <Keyword Count="0" HexcodeCount="0" Name="/JS"/>
        <Keyword Count="0" HexcodeCount="0" Name="/JavaScript"/>
        <Keyword Count="0" HexcodeCount="0" Name="/AA"/>
        <Keyword Count="0" HexcodeCount="0" Name="/OpenAction"/>
        <Keyword Count="0" HexcodeCount="0" Name="/AcroForm"/>
        <Keyword Count="0" HexcodeCount="0" Name="/JBIG2Decode"/>
        <Keyword Count="0" HexcodeCount="0" Name="/RichMedia"/>
        <Keyword Count="0" HexcodeCount="0" Name="/Launch"/>
        <Keyword Count="0" HexcodeCount="0" Name="/EmbeddedFile"/>
        <Keyword Count="0" HexcodeCount="0" Name="/XFA"/>
        <Keyword Count="0" HexcodeCount="0" Name="/Colors &gt; 2^24"/>
    </Keywords>
    

    and make everything count=0 if it is suspicious