I have a component that converts PDF documents to images, one image per page. Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.
I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. A simple and naive approach with RenderedImage
s:
final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
// V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...
Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file (tempFile
) and just redirect it to elsewhere without having it to be stored as an in-memory image. I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). But I couldn't find anything like that in ImageDecoder
, or I'm probably wrong with my expectations and just missing something important here...
Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? Thanks in advance.
I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).
By using the TIFFUtilities
class from com.twelvemonkeys.contrib.tiff
, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this:
TIFFUtilities.split(tempFile, new File("output"));
No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts.
Files will be named output/0001.tif
, output/0002.tif
etc. If you need more control over the output name or have other requirements, you can easily modify the code. The code comes with a BSD-style license.