Search code examples
pythonpdfpython-imaging-librarypnglibpng

decoding PDF: can I use PIL/Pillow to access the PNG predictor algorithm in order to reverse it for ingest to PIL?


I'm decoding PDF files using Python with reference to the 2008 spec: https://web.archive.org/web/20081203002256/https://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf particularly section 7.4.4.4.

Images are usually embedded in PDF as byte streams, and each stream is associated with a dictionary with information about the stream. For example, often the stream is a compressed form of the original data; such details are described by the Filter entry in the dictionary.

When I've got a stream whose filter is FlateDecode, this means the data were compressed using deflate, and this is easily reversed with zlib.decompress. But... to improve compression the original data may be preprocessed by a filter, for example to difference adjacent bytes - when the data have a lot of similar values the result then compresses better. The preprocessing is identified by the Predictor entry in the dictionary.

The Predictor value 15 means to use a PNG differencing algorithm; unfortunately the 2008 PDF document basicly says "PNG prediction (on encoding, PNG optimum)". Yay.

Can someone explain to me (a) exactly which PNG filter algorithm this means (with a reference to its specification) and (b) ideally point me at a library which will reverse it. Lacking the latter I'd have to reverse it in pure Python, which will be slow. Acceptably slow for my initial use case, and I guess I can write it as a C extension (much) later if my needs become more frequent.

Where I am at present is:

  • I've got the uncompressed stream data as a bytes object, which is raw pixel data
  • I know the Predictor value, 15 in my present example document
  • if I don't reverse the predictor algorithm and decode the data as is I get an image which (a) looks like edges because the data express differences instead of direct colour components and (b) is skewed because the rows have a leading indicator of some kind (the PDF spec says "The postprediction data for each PNG-predicted row shall begin with an explicit algorithm tag; therefore, different rows can be predicted with different algorithms to improve compression. TIFF Predictor 2 has no such identifier; the same algorithm applies to all rows." and I'm decoding the tag as pixel data, making each row a bit too long

Currently my image property method looks like this:

  @property
  def image(self):
    im = self._image
    if im is None:
      decoded_bs = self.decoded_payload
      print(".image: context_dict:")
      print(decoded_bs[:10])
      pprint(self.context_dict)
      decode_params = self.context_dict.get(b'DecodeParms', {})
      color_transform = decode_params.get(b'ColorTransform', 0)
      color_space = self.context_dict[b'ColorSpace']
      bits_per_component = decode_params.get(b'BitsPerComponent')
      if not bits_per_component:
        bits_per_component = {b'DeviceRGB': 8, b'DeviceGray': 8}[color_space]
      colors = decode_params.get(b'Colors')
      if not colors:
        colors = {b'DeviceRGB': 3, b'DeviceGray': 1}[color_space]
      mode_index = (color_space, bits_per_component, colors, color_transform)
      width = self.context_dict[b'Width']
      height = self.context_dict[b'Height']
      print("mode_index =", mode_index)
      PIL_mode = {
          (b'DeviceGray', 1, 1, 0): 'L',
          (b'DeviceGray', 8, 1, 0): 'L',
          (b'DeviceRGB', 8, 3, 0): 'RGB',
      }[mode_index]
      print(
          "Image.frombytes(%r,(%d,%d),%r)...", PIL_mode, width, height,
          decoded_bs[:32]
      )
      im = Image.frombytes(PIL_mode, (width, height), decoded_bs)
      im.show()
      exit(1)
      self._image = im
    return im

This shows me the "edgy" and skewed image because I'm decoding difference data as colour data and decoding the row tags as pixel data, skewing subsequent rows slightly.


Solution

  • The predictor used for each row is given by the first byte in each row, if the "Predictor" parameter is 10 or more. In that case, the value of that parameter has no further meaning. It doesn't matter that it's 15, other than the fact that 15 >= 10.

    You can find the filter types here:

    a, b, c, and x

    png filter types

    paeth predictor