Search code examples
pdfpdfboxcolor-spacepdfclown

Unable to extract cmyk colorspaces from pdf


I'm trying to extract colorspace data from pdf. I have a file with Pantone and CMYK colorspaces. When I extracted the colorspaces from PDF using any pdf library (I used pdfclown, pdfbox and icePdf), the output data consists only of Pantone colorspaces data but not even single info about CMYK colorspace. I examined the file in CorelDraw software, when I clicked on the colorspace it shows the exact colorspace value like (PANTONE 3735 C, C 0 M 50 Y 50 K 0 e.t.c). How can I extract all the colorspaces present in a pdf (Pantone/CMYK) ?

using (var file = new org.pdfclown.files.File(filePath))
{
       org.pdfclown.documents.Document document = file.Document;

       foreach (org.pdfclown.documents.Page page in document.Pages)
       {
             ContentScanner cs =  new ContentScanner(page); // Wraps the page contents into the scanner.

             System.Collections.Generic.List<org.pdfclown.documents.contents.colorSpaces.ColorSpace> list = cs.Contents.ContentContext.Resources.ColorSpaces.Values.ToList();
                    for (int i = 0; i < list.Count; i++)
                    {
                            // Print list of colorspaces available
                    }
        }
}

Sample PDF Document having CMYK and PANTONE Colors

Output from 'pdfclown' showing PANTONE and its alternative colorspaces:

screen shot


Solution

  • Original answer

    Unfortunately you don't show your code. But your screen shot looks like you merely look at the ColorSpace section of the page Resources. This does not suffice in a number of ways:

    • First of all, the colorspace resources are referenced by name from the content streams (cf. the Contents entry on your screen shot) to select colorspaces for stroking or filling. But there are some predefined names that do not need to be described in the resources, cf. the documentation of the CS operator:

      Set the current colour space to use for stroking operations. The operand name shall be a name object. If the colour space is one that can be specified by a name and no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain cases of Pattern), the name may be specified directly. Otherwise, it shall be a name defined in the ColorSpace subdictionary of the current resource dictionary.

      (ISO 32000-1, Table 74 – Colour Operators)

      Thus, to check whether DeviceGray, DeviceRGB, or DeviceCMYK are used, you have to scan the content stream for color space selection operations (CS or cs) using these names.

      Furthermore, there even are shortcut color selection operations which set either of those colorspaces and immediately select a color therein (g, G, rg, RG, k, K) for which you also have to scan the content stream.

      E.g. in your page content stream you can find:

      0.3 0 1 0 k
      

      and

      0.9 g
      

      and multiple other occurrences of these operators. Thus, at least DeviceGray and DeviceCMYK are in use (in addition to the resources you found).

    • Furthermore, not all of the colorspaces you find in the Colorspace resource dictionary are necessarily actually used in the content. Thus, while scanning the content as above for uses of undeclared namespaces, you also have to scan for declared namespaces to ensure that they actually are used.

    • You also have to look at other resources used from your content streams:

      • The bitmap images (XObjects with Subtype value Image), e.g. Im1 has ColorSpace DeviceCMYK and Im5 has ColorSpace DeviceRGB.

        Again you have to make sure that the bitmaps actually are used in your content stream.

        Beware, JPEG2000 bitmaps may bring along their own colorspace definition in their own format!

      • Shadings, all Shadings in your PDF have ColorSpace DeviceCMYK. Again make sure they're actually used.

      • Form XObjects and Patterns have content streams and resources of their own. Don't forget deep-searching into their structure. In your case, though, there are none.

      • Type 3 Fonts glyphs are defined via content streams and resources, they may also have their own colorspace. None are used in your file.

      • Transparency groups also may have a colorspace setting specifying among other things the colour space of the group as a whole when it in turn is painted as an object onto its backdrop.

    • ...

    Maybe I forgot 1 or 20 other places to look for relevant colorspace settings...

    For your file, though, already the places mentioned above show that in addition to your ColorSpace resources also DeviceGray, DeviceRGB, and DeviceCMYK are used in your PDF.

    On the comments

    As you meanwhile have provided code and this code uses PDF Clown, I'll use it here, too. You can do equivalent stuff with PDF Box.

    Scan through a content stream

    A How to scan through a ContentStream ( checked the BaseDataObject of the 'Contents', it is like this ' [0] {cm [1, 0, 0, 1, 0, 0]}, 1 {gs [GS11]}'

    With PDF Clown you usually scan though a content stream using a ContentScanner. And in your code you already have a ContentScanner cs. Thus, simply call ScanForColorspaceUsage(cs) in your loop with ScanForColorspaceUsage defined like this:

    void ScanForColorspaceUsage(ContentScanner cs)
    {
        while (cs.MoveNext())
        {
            ContentObject content = cs.Current;
            if (content is CompositeObject)
            {
                ScanForColorspaceUsage(cs.ChildLevel);
            }
            else if (content is SetFillColorSpace _cs)
            {
                Console.WriteLine("Used as fill color space: {0}", _cs.Name);
            }
            else if (content is SetDeviceCMYKFillColor _k)
            {
                Console.WriteLine("Used as fill color space: DeviceCMYK");
            }
            else if (content is SetDeviceGrayFillColor _g)
            {
                Console.WriteLine("Used as fill color space: DeviceGray");
            }
            else if (content is SetDeviceRGBFillColor _rg)
            {
                Console.WriteLine("Used as fill color space: DeviceRGB");
            }
            else if (content is SetStrokeColorSpace _CS)
            {
                Console.WriteLine("Used as stroke color space: {0}", _CS.Name);
            }
            else if (content is SetDeviceCMYKStrokeColor _K)
            {
                Console.WriteLine("Used as stroke color space: DeviceCMYK");
            }
            else if (content is SetDeviceGrayStrokeColor _G)
            {
                Console.WriteLine("Used as stroke color space: DeviceGray");
            }
            else if (content is SetDeviceRGBStrokeColor _RG)
            {
                Console.WriteLine("Used as stroke color space: DeviceRGB");
            }
        }
    }
    

    All colorspaces

    B Whether the colorspace is used or not, I want to display all the Colorspaces available in the pdf and in the above document when I checked in CorelDraw it was displaying around 30-35 colorspaces as cmyk(in the second line of horizontal array of colorspaces)

    Going through your document, whenever CMYK color is used, it is used via the DeviceCMYK color space, no special ICCBased one. Thus, only one CMYK colorspace is used in your PDF.

    I don't have CorelDraw, so I cannot tell what exactly it shows you. Or do you mean individual CMYK colors?

    Learn deeper

    C Where can I learn deeper about these things to understand better?

    If by these things you mean how this all is represented in PDFs, the PDF specification might be a good reference. The most current one, ISO 32000-2, is only available for money, e.g. from the ISO store, but the older one, ISO 32000-1, is also shared by Adobe for download as PDF32000_2008.pdf.