I was building a tool to extract images in a PDF file using Rust. For now I can extract all other images except PNG.
I used a crate called pdf
to extract the images.
let mut images: Vec<_> = vec![];
for page in file.pages() {
let page = page.unwrap();
let resources = page.resources()?;
images.extend(
resources
.xobjects
.iter()
.map(|(_name, &r)| file.get(r).unwrap())
.filter(|o| matches!(**o, pdf::object::XObject::Image(_))),
)
}
Now depending upon the filter I set the format
for (i, o) in images.iter().enumerate() {
let img = match **o {
XObject::Image(ref im) => im,
_ => continue,
};
let (data, filter) = img.raw_image_data(&file)?;
use StreamFilter::*;
let ext = match filter {
Some(DCTDecode(_)) => "jpeg",
Some(JBIG2Decode) => "jbig2",
Some(JPXDecode) => "jp2k",
_ => {
log::debug!("main : unsupported image format");
continue;
}
};
The problem arises when PNG images are in a PDF file. The filter
is always a None
value. So I tried to experiment a little bit by using the image
crate to decode the PNG image.
for (i, o) in images.iter().enumerate() {
let img = match **o {
XObject::Image(ref im) => im,
_ => continue,
};
let (data, filter) = img.raw_image_data(&file)?;
use StreamFilter::*;
let ext = match filter {
Some(DCTDecode(_)) => "jpeg",
Some(JBIG2Decode) => "jbig2",
Some(JPXDecode) => "jp2k",
_ => {
let img = image::io::Reader::new(Cursor::new(data.clone()))
.with_guessed_format()?
.decode()?;
"png"
}
};
But on running the code I got
Error: Unsupported(UnsupportedError { format: Unknown, kind: Format(Unknown) })
I tried to read the PDF 1.7 reference document and then used flate2
library to decompress the data but I got
Error : Invalid Signature
How should I proceed from here ?
PNG images are not included in a PDF file directly (unlike JPEG and JPEG2000, for example), so you cannot extract them directly.
You also cannot do JBIG2Decode directly - it is more complicated than you suppose.
For PNG, The data in the FlateDecode stream is just the IDAT chunk from the PNG. You must reconstruct the IHDR chunk from, for example /Width and /Height and /BPC and /ColourSpace, and then write the PNG file yourself, calculating checksums and building chunks according to the PNG standard.