Search code examples
pdfrustencodingpngdecoding

How to decode a PNG image in a PDF file?


I was building a tool to extract images in a PDF file using Rust. For now I can extract all other images except PNG.

I used a crate called pdf to extract the images.

let mut images: Vec<_> = vec![];

    for page in file.pages() {
        let page = page.unwrap();

        let resources = page.resources()?;

        images.extend(
            resources
                .xobjects
                .iter()
                .map(|(_name, &r)| file.get(r).unwrap())
                .filter(|o| matches!(**o, pdf::object::XObject::Image(_))),
        )
    }

Now depending upon the filter I set the format

for (i, o) in images.iter().enumerate() {
        let img = match **o {
            XObject::Image(ref im) => im,
            _ => continue,
        };

        let (data, filter) = img.raw_image_data(&file)?;

        use StreamFilter::*;

        let ext = match filter {
            Some(DCTDecode(_)) => "jpeg",
            Some(JBIG2Decode) => "jbig2",
            Some(JPXDecode) => "jp2k",
            _ => {
                log::debug!("main : unsupported image format");
                continue;
            }
        };

The problem arises when PNG images are in a PDF file. The filter is always a None value. So I tried to experiment a little bit by using the image crate to decode the PNG image.

for (i, o) in images.iter().enumerate() {
        let img = match **o {
            XObject::Image(ref im) => im,
            _ => continue,
        };

        let (data, filter) = img.raw_image_data(&file)?;

        use StreamFilter::*;

        let ext = match filter {
            Some(DCTDecode(_)) => "jpeg",
            Some(JBIG2Decode) => "jbig2",
            Some(JPXDecode) => "jp2k",
            _ => {
                let img = image::io::Reader::new(Cursor::new(data.clone()))
                    .with_guessed_format()?
                    .decode()?;
                "png"
            }
        };

But on running the code I got

Error: Unsupported(UnsupportedError { format: Unknown, kind: Format(Unknown) })

I tried to read the PDF 1.7 reference document and then used flate2 library to decompress the data but I got

Error : Invalid Signature

How should I proceed from here ?


Solution

  • PNG images are not included in a PDF file directly (unlike JPEG and JPEG2000, for example), so you cannot extract them directly.

    You also cannot do JBIG2Decode directly - it is more complicated than you suppose.

    For PNG, The data in the FlateDecode stream is just the IDAT chunk from the PNG. You must reconstruct the IHDR chunk from, for example /Width and /Height and /BPC and /ColourSpace, and then write the PNG file yourself, calculating checksums and building chunks according to the PNG standard.