Search code examples
javapdfitextapache-tika

How to check if a PDF document contains an image


I am reading text from PDF documents using the iText library. However, some pdf documents might have an image embedded with-in them in addition to text.

I'm wondering whether there is any way, through iText or something else, to determine if the pdf document contains an image?


Solution

  • You can do a correct and 100% reliable check using a PDF library.

    However you can probably do a fairly reliable check just by reading the PDF as text and processing it that way. You need to first check it is a PDF by looking for the PDF header at the start,

    %PDF...
    

    Then scan through looking for the phrase,

    /XObject
    

    When you hit this tag you need to check backwards and forwards in the stream to the << and >> dictionary boundaries to pull out the full XObject dictionary. There may be nested << and >> so you might want to check back to the 'obj' and forwards to the 'stream' entry. Anyhow you'll end up with something that looks like this,

    << 
    /Type /XObject /Subtype /Image /Name /I1 
    /Width 800 /Height 128 
    /BitsPerComponent 1 /ImageMask true 
    /Filter [/FlateDecode] 
    /Length 2302 >> 
    

    The thing you need to check here is that there is this /Subtype entry and an /Image separated by some whitespace. If you hit that then you have an image.

    So what are the limits of this approach?

    Well it is possible to embed an image in the document but not use it. That would result in a false positive. I think this is pretty unlikely though. It would be very inefficient to do so and only a really skanky producer would do it.

    Images can be embedded in page content streams as mentioned by Hugo above. That would result in a false negative. These are pretty uncommon though. It's one of those bits of the spec which was never a good idea and it's not widely used. If you have documents from a single producer (as is often the case) it will beome apparent very quickly if it does this or not. However I think it would be pretty uncommon. At a guess I can't imagine that more than 1% of wild PDFs would contain this construct.

    It is possible to embed these XObject tags as references rather than direct objects. But I think you can completely discount that. While legal it would be absolutely bizare. I don't think you'll ever see that.

    The correct way involves scanning and parsing all the content streams in the PDF. It's what we do in ABCpdf (which I work on) but it is a lot more work and a lot more processing power. It could be many seconds on a large document.

    Think if 99% reliability is going to be good enough. :-)