Search code examples
pdfpypdf

Correct format of ICC-based /ColorSpace in PDF


I am generating PDF files on-the-fly. The files contain JPEG images in the Adobe RGB (1998) colourspace, with the profile embedded. The PDF generation toolkit embeds the images correctly, but sets the /ColorSpace image object stream metadata to /DeviceRGB, with no way of changing it. When the PDF is printed, the image colours are incorrect, probably because they are not interpreted in the Adobe RGB colourspace.

Example of the original PDF object structure:

obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace /DeviceRGB
/Length 174780
>>
stream (jpeg data)
endstream
endobj

Therefore I am trying to alter the PDF after the fact, to change the /ColorSpace key to use the Adobe RGB ICC profile. Using the code below, the object structure becomes as follows, which looks correct against other PDFs I have seen, but results in a corrupted PDF. Where have I gone wrong?

obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace [ /ICCBased <<
/N 3
/Filter /FlateDecode
/Length 284
>> stream (icc data)
endstream ]
/Length 174780
>>
stream (jpeg data)
endstream
endobj

This is the pypdf code which loads original.pdf, locates every image, replaces /ColorSpace /DeviceRGB with /ColorSpace /ICCBased, and writes out to edited.pdf.

from pathlib import Path
from pypdf import PdfWriter
from pypdf.generic import NameObject, ArrayObject, StreamObject

writer = PdfWriter(clone_from="original.pdf")

icc_stream = StreamObject()
icc_stream.set_data(Path("AdobeRGB1998.icc").read_bytes())
colorspace = ArrayObject([
    NameObject("/ICCBased"),
    icc_stream.flate_encode()
])

for page in writer.pages:
    for image in page.images:
        image.indirect_reference.get_object()[NameObject("/ColorSpace")] = colorspace

with open("edited.pdf", "wb") as fp:
    writer.write(fp)

Solution

  • The problem was a rookie error in PDF format. I was embedding the ICC profile stream within the image object stream:

    10 0 obj
    <<
    /Type /XObject
    /Subtype /Image
    /Width 2000
    /Height 2000
    /BitsPerComponent 8
    /Interpolate true
    /Filter /DCTDecode
    /ColorSpace [ /ICCBased 
      <<
      /N 3
      /Filter /FlateDecode
      /Length 284
      >> stream (icc data)
      endstream
    ]
    /Length 174780
    >>
    stream (jpeg data)
    endstream
    endobj
    

    when I should have been using an indirect reference to the ICC data instead:

    10 0 obj
    <<
    /Type /XObject
    /Subtype /Image
    /Width 2000
    /Height 2000
    /BitsPerComponent 8
    /Interpolate true
    /Filter /DCTDecode
    /ColorSpace [ /ICCBased 20 0 R ]
    /Length 174780
    >>
    stream (jpeg data)
    endstream
    endobj
    
    20 0 obj
    <<
    /N 3
    /Filter /FlateDecode
    /Length 284
    >>
    stream (icc data)
    endstream
    endobj
    

    The corrected code from above would be:

    from pathlib import Path
    from pypdf import PdfWriter
    from pypdf.generic import ArrayObject, NameObject, NumberObject, StreamObject
    
    writer = PdfWriter(clone_from="original.pdf")
    
    icc_stream = StreamObject()
    icc_stream.set_data(Path("AdobeRGB1998.icc").read_bytes())
    icc_stream[NameObject("/N")] = NumberObject(3)
    icc_ref = writer._add_object(icc_stream.flate_encode())
    
    for page in writer.pages:
        for image in page.images:
            obj = image.indirect_reference.get_object()
            obj[NameObject("/ColorSpace")] = ArrayObject(
                [NameObject("/ICCBased"), icc_ref]
            )
    
    with open("edited.pdf", "wb") as fp:
        writer.write(fp)