Search code examples
javatiffjai

JAI: How do I extract a single page input stream from a multipaged TIFF image container?


I have a component that converts PDF documents to images, one image per page. Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.

I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. A simple and naive approach with RenderedImages:

final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
//                                               V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...

Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file (tempFile) and just redirect it to elsewhere without having it to be stored as an in-memory image. I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). But I couldn't find anything like that in ImageDecoder, or I'm probably wrong with my expectations and just missing something important here...

Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? Thanks in advance.


Solution

  • I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).

    By using the TIFFUtilities class from com.twelvemonkeys.contrib.tiff, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this:

    TIFFUtilities.split(tempFile, new File("output"));
    

    No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts.

    Files will be named output/0001.tif, output/0002.tif etc. If you need more control over the output name or have other requirements, you can easily modify the code. The code comes with a BSD-style license.