Search code examples
objective-ciospdfquartz-2dcgpdf

CGPDFScannerScan doesn't fire callback functions


I parse pdf files using Quartz.

Everything works fine except for one file. Callback functions are not call at all.

My operator table has been created, I added operators into it with CGPDFOperatorTableSetCallback. Everything seem ok, just callbacks are not called.

Have you any idea what can caused this behaviour ?


Solution

  • The page content is a large form XObject. Form XObjects are self contained graphic objects that use a content stream like the page.
    You need to do the following: include the 'Do' operator in the list of scanned operators. When it is encountered, its operand is the symbolic name of a XObject. Get the 'Resources' dictionary from the page dictionary. From the 'Resources' dictionary get the 'XObject' dictionary. From the 'XObject' dictionary get your xobject using the symbolic name used with the 'Do' operator. From the xobject get the value of the 'Subtype' key. If it is 'Image' ignore the xobject because it is an image. If it is 'Form' then you have a form XObject. Get the stream from the xobject and scan it the same way you scanned the page content stream. You can reuse the same scanner class, you just need to keep a context in order to know what object you are scanning. Form XObjects can use other form XObjects, they being located in the parent form XObject 'Resources' dictionary.
    Your page dictionary looks like this:

    <<
    /ArtBox[0.0 0.0 768.0 7066.0]
    /BleedBox[0.0 0.0 768.0 7066.0]
    /Contents 29 0 R
    /CropBox[0.0 0.0 768.0 7066.0]
    /Group 62 0 R
    /MediaBox[0.0 0.0 768.0 7066.0]
    /Parent 23 0 R
    /Resources
     <<
      /ExtGState<</GS0 30 0 R>>
      /XObject<</Fm0 61 0 R>>
     >>
    /Rotate 0
    /TrimBox[0.0 0.0 768.0 7066.0]
    /Type/Page
    >> 
    

    The 'Fm0' is the name of the form XObject used in the page content stream, the operand for the 'Do' operator. Its resources dictionary looks like this:

    /Resources
     <<
      /ColorSpace<</CS0 32 0 R>>
      /ExtGState<</GS0 34 0 R/GS1 30 0 R>>
      /Font<</T1_0 38 0 R/T1_1 40 0 R>>
      /ProcSet[/PDF/Text]
      /XObject<</Fm0 45 0 R/Fm1 48 0 R/Fm2 51 0 R/Fm3 54 0 R/Fm4 57 0 R/Fm5 60 0 R>>
     >>
    

    As you can see it uses several other form XObjects.