Search code examples
c#pdfitext

How to get each stream data of pdf file?


I'm working on a pdf decrypt task.

The pdf provider/vendor encrypts each stream data by their own method. And they provide decrypt function as well.

I verified that the decrypt function works correctly on one stream data. But the pdf file could have many streams, so I need to extract each stream data and feed it to the decrypt function.

Below is one of the stream from the decrypted pdf file:

6 0 obj
<</Length 608/Filter[/VendorPDFEncrypt/FlateDecode]>>stream
data_1
endstream
endobj

And the pdf vendor provides the decrypted pdf file to me, so I find the corresponding stream in it, as below. As you can see, vendor added filter disappears and the data part changes.

2 0 obj
<</Filter[/FlateDecode]/Length 598>>stream
data_2
endstream
endobj

Summay process:

encrypted pdf file -> extract each stream data -> feed it to decrypt function repeatedly-> get a readable pdf file

My question is how to extract each stream data from the pdf file? So I can use the decrypt function to handle each stream data.


Solution

  • You can extract the PDF file to JSON like this, including the streams of compressed data:

    cpdf -output-json in.pdf -o out.json

    Then, when you have processed the JSON to decompress the stream, you can check by roundtripping back to PDF:

    cpdf -j new.json -o out.pdf