Search code examples
iosobjective-cpdfquartz-2d

Parsing PDF documents with Quartz


I am trying to parse a PDF document with the Quartz framework and have copy & pasted the code snippets from the Apple documentation into my source code. Unfortunately, it does not retrieve any data. It just iterates over the pages, logs the number of the current page to the console and crashes at the end. Do you have any idea on what is wrong with the code?

static void op_MP (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

    printf("MP /%s\n", name);
}

static void op_DP (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"DP /%s\n", name);
}

static void op_BMC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

    NSLog(@"BMC /%s\n", name);
}

static void op_BDC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;
     NSLog(@"BDC /%s\n", name);
}

static void op_EMC (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"EMC /%s\n", name);
}

static void op_TJ (CGPDFScannerRef s, void *info)
{
    const char *name;

    if (!CGPDFScannerPopName(s, &name))
        return;

     NSLog(@"TJ /%s\n", name);
}

- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions
{
    CGPDFDocumentRef myDocument;
    NSString *urlAddress = [[NSBundle mainBundle] pathForResource:@"test" ofType:@"pdf"];
    NSURL *fileUrl = [NSURL fileURLWithPath:urlAddress];
    CFURLRef url = (__bridge CFURLRef)fileUrl;
    myDocument = CGPDFDocumentCreateWithURL(url);

    CFRelease (url);

    if (myDocument == NULL) {// 2
        NSLog(@"can't open `%@'.", fileUrl);
     }
    if (!CGPDFDocumentIsUnlocked (myDocument)) {// 4
         CGPDFDocumentRelease(myDocument);
    }
    else if (CGPDFDocumentGetNumberOfPages(myDocument) == 0) {// 5
        CGPDFDocumentRelease(myDocument);
    }
    else {
        CGPDFOperatorTableRef myTable;
        myTable = CGPDFOperatorTableCreate();

        CGPDFOperatorTableSetCallback (myTable, "MP", &op_MP);
        CGPDFOperatorTableSetCallback (myTable, "DP", &op_DP);
        CGPDFOperatorTableSetCallback (myTable, "BMC", &op_BMC);
        CGPDFOperatorTableSetCallback (myTable, "BDC", &op_BDC);
        CGPDFOperatorTableSetCallback (myTable, "EMC", &op_EMC);
        CGPDFOperatorTableSetCallback (myTable, "Tj", &op_TJ);

        int k;
        CGPDFPageRef myPage;
        CGPDFScannerRef myScanner;
        CGPDFContentStreamRef myContentStream;

        int numOfPages = CGPDFDocumentGetNumberOfPages (myDocument);// 1
        for (k = 0; k < numOfPages; k++) {
            myPage = CGPDFDocumentGetPage (myDocument, k + 1 );// 2
            myContentStream = CGPDFContentStreamCreateWithPage (myPage);// 3
            myScanner = CGPDFScannerCreate (myContentStream, myTable, NULL);// 4
            CGPDFScannerScan (myScanner);// 5
            CGPDFPageRelease (myPage);// 6
            CGPDFScannerRelease (myScanner);// 7
            CGPDFContentStreamRelease (myContentStream);// 8
            NSLog(@"processed page %i",k);
        }
        CGPDFOperatorTableRelease(myTable);
        CGPDFDocumentRelease(myDocument);
    }

    return YES;
}

Solution

  • I did not run the code but the first 5 operators might not exist in your page content. Also some of them have a name operand, some of them do not have any operands (such as EMC). Also the Tj operator has a string operand, not a name.
    Remove all the pop name methods and leave only the logging and you might get some output. Then look in the PDF specification to see the exact operands for each operator and update your code accordingly.