Search code examples
javaclinuxpdflamp

Inline text editing within a PDF file


I was wondering if there is a programming library available that allows for the inline editing of text within a PDF document. Drawing text unto the document isn't what I'm after this time and I am already aware of a number facilities and libraries that allow this to be done; I am looking for something that will allow me to make a change like this (where NEW isn't drawn in but edited in, for instance, a string):

"This is my document" become "This is my NEW document".

... The formatting should be preserved (especially where editing isn't being done within a specific area on the page). Word wrapping support would be great too!

So is there anything like this out there or am I barking up the wrong tree? I've looked at a range facilities such as FPDF, PdfBox, and even GNOME without much luck (tbh, I am sure GNOME may allow it but getting my head around it is too time consuming at the moment- so pointers on this will be also be great).

Thanks and sorry if this has been already asked.

In terms of programming languages: I willing to utilise what is suggested in C, C++, Java, PHP, Python, and Perl.


Solution

  • To follow up on my comments, this is what fairly typical raw PDF text output looks like -- a deflated part of page 1213 of the PDF Reference Guide 16-v4:

    36451 0 obj  % Contents
    % used filter: FlateDecode
    /GS2 gs
    BT
    /F1 1 Tf
    8 0 0 8 297.417 105.667 Tm
    0 0 0 1 k
    0 Tc
    0 Tw
    (1213) Tj
    /F5 1 Tf
    24 0 0 24 253.784 617 Tm
    [ (C) 19.1 (olophon) ] TJ
    /F3 1 Tf
    10.505 0 0 10.505 136.5 566 Tm
    -0.0014 Tc
    0.2018 Tw
    [ (This do) -10.1 (c) -7.2 (u) -0.3 (men) 17.6 (t) -1.4 ( was p) 10 (r) 11.9 (o) -10.1 (d) 10.8 (uce) -7.2 (d) -1.3 ( usin) 6.6 (g ) 36.5 (A) 24.6 (d) 0.9 (o) 3.8 (b) -10.1 (e) ] TJ
    8.4 0 0 8.4 326.25 570.2 Tm
    0 Tc
    

    .. several hundred more lines like these omitted. Some points of interest: Tf sets the text font (which is defined elsewhere, and which may have a custom encoding -- not always ASCII). Tj 'shows' text; Tm sets a transformation matrix in 'current units'. It's impossible to immediately see whether the text 'Colophon' follows right after the '1213' without knowing the actual size of both. The Tc and Tw set default character and word spacing, and is often abused to insert 'spaces'. Not here, though; the TJ array specifies text fragments with interspersed kerning values (I guess, based on their location).

    It's not possible to determine of this single text line is a line on its own, or part of a longer paragraph. It's not even possible to determine if it's a justified string or not -- you would need to compare its left and right edges to other lines to find out.

    (This output is created with a PDF reader I wrote myself from scratch, using aforementioned reference and not much more.)

    As you can see, merely finding text is a challenge, although there are libraries which are more or less successful in that. None of them -- if I'm correct -- boast to be able to edit "any PDF".