Search code examples
iospdf

Understanding PDF operators - for iOS app


I am tasked to create a pdf reader app for our company. After a few research, I became confused with the different operators inside the PDF. Here are a few things that I would like to clarify:

  • The Tm operator is used as the starting point of each line. (Is my understanding correct?)
  • If the Tm operator is the starting point of every line, how can I parse the text shown only within the specified Tm? e.g.:

     BT
        0 0 1 rg
        /Ti 12 Tf
        1 0 0 1 100 100 Tm
        0 0 Td
        (The quick brown fox ) Tj 0 −13 Td
        (ate the lazy mouse.) Tj
     ET
     //I only want to get the Tj and TJ string being positioned by the Tm
    
  • I understand that every 1000 units of a glyph's height and width is equivalent to 1 unit of text space. So if the glyph width is 2000 and it's height is 1060, does that mean that the "real" width and height of it is 2 and 1.06 respectively?

Now I know that some of these questions sound outright stupid, but I really don't have much time to research. So if anyone can help me understand this, it will be definitely appreciated.

NOTE: The pdf reader app must contain search and highlight function, text selection, notes, bookmark, etc. Practically all the basic stuff you can find in almost every reader available nowadays. I will probably use a third-party library for this to make my life easier, but my biggest problem will be the Text selection function. So I really need to understand this.


Solution

  • You'll need to familiarize yourself with the PDF specification, the annex A contains a summary of all the operators with links to more detailed documentation about the parameters, so that may be a good starting point.

    The Tm operator doesn't necessarily set the starting point of each line, it generally sets the text matrix, which is basically equivalent to a CGAffineTransform in terms of Quartz2D. To move to the next line, a document could also use the Td, TD, " or T* operators. PDF documents don't necessarily draw their text in the order that appears on screen, they may move around on the page freely and position the glyphs in any order they see fit. PDF doesn't really have the concept of a "line", you'll have to infer those from the position of the glyphs yourself (which can be tricky for things like subscript/superscript).