Search code examples
pdfwatermarktext-extraction

How do Text Objects in PDF work?


I have a PDF document of which I would like to remove watermarks as automatically as possible to get better results from pdftotext.

After uncompressing it with pdftk I see the watermark almost in plain text:

BT
1 0 0 1 277.40012 755.2005 Tm
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
[()]TJ
0 0 Td
[(Abc)30(defghi K)30(lm)-40(no)]TJ
-5.423981 -9.600038 Td
[()]TJ
0 0 Td
[(Apr 01, 2017 12:34)]TJ
ET

The watermark is

Abcdefghi Klmno
Apr 01, 2017 12:34

After skimming through Document management — Portable document format (especially page 248f), I found the following:

BT: Begin Text
Tm: Text matrix - what is that?
x y Td: Move to the start of the next line with an offset of (x, y)
TJ: Text showing
Tf: Text state
ET: End Text

What I don't understand is all the numbers and why

[(Abc)30(defghi K)30(lm)-40(no)]TJ

Does it increase the space between Abc and defghi K and decrease the space between lm and no (seems so, looking at Figure 46 on page 259)? By what unit?

What does Tf do?

Could somebody please explain that?


Solution

  • What I don't understand is all the numbers and why

    [(Abc)30(defghi K)30(lm)-40(no)]TJ
    

    Does it increase the space between Abc and defghi K and decrease the space between lm and no (seems so, looking at Figure 46 on page 259)?

    Nearly so, the positive value decreases and the negative value increases, cf. Table 109 – Text-showing operators in the PDF specification:

    array TJ : Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.

    The figure is misleading, obviously some type-setting program scrambled up the effect the author wanted to show. The actual source of the figure looks like this:

    BT
    /T1_2 1 Tf
    0 Tc 8.7503 0 0 8.7503 118.989 450.2115 Tm
    [([ \()11(A)53(W)57(A)79(Y again\) ] )41(T)43(J)]TJ
    40.0016 0 0 40.0015 296.9949 440.2111 Tm
    [(A)53(W)57(A)79(Y again)]TJ
    8.7503 0 0 8.7503 118.989 403.2097 Tm
    [([ \()11(A)9(\) 120 \()-50(W)-55(\) 120 \()11(A)9(\) 95 \()-41(Y again\) ] )41(T)43(J)]TJ
    40.0016 0 0 40.0015 296.9949 392.2093 Tm
    (AWAY again)Tj
    ET 
    

    By what unit?

    thousandths of a unit of text space, cf. the quote above.

    Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space.

    This often coincides with a single unit in glyph space


    What does Tf do?

    According to Table 105 – Text state operators in the PDF specification

    font size Tf : Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary; size shall be a number representing a scale factor. There is no initial value for either font or size; they shall be specified explicitly by using Tf before any text is shown.


    The only thing I don't understand now is the line

    0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
    

    Can you explain that, too?

    The instruction

    0.501961 0.501961 0.501961 rg
    

    sets the fill color to a medium gray in an RGB color space.

    Then

    /R1 gs
    

    sets additional graphics state parameters from the ExtGState resource named R1; probably here some transparency effect is defined.

    Finally

    /R2 8 Tf
    

    sets the font to one defined by the Font resource named R2 and the font size to 8.