I want to copy the text of an original PDF document into a new PDF document preserving the formatting of the source text.
I have already done some tests, but the result of copying the text into the new document is not I hoped. Below I show code in the content stream.
for (PDPage page : newDoc.getPages()) {
PDPageContentStream contentStream = new PDPageContentStream(newDoc, page);
contentStream.beginText();
for(List<TextLine> row : rowList){
for(TextLine characters : line){
contentStream.setFont(characters.getFont(), characters.getFontSize());
contentStream.newLineAtOffset(characters.getxPos(), characters.getyPos());
contentStream.setLeading(10.5f);
contentStream.showText(characters.getText());
}
}
contentStream.endText();
contentStream.close();
}
We already discussed your approach in the comments to your question and you eventually asked for a practical example.
Unfortunately your code is not compilable, let alone runnable, so I had to create somewhat different code:
void copyText(PDDocument source, int sourcePageNumber, PDDocument target, PDPage targetPage) throws IOException {
List<TextPosition> allTextPositions = new ArrayList<>();
PDFTextStripper pdfTextStripper = new PDFTextStripper() {
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
allTextPositions.addAll(textPositions);
super.writeString(text, textPositions);
}
};
pdfTextStripper.setStartPage(sourcePageNumber + 1);
pdfTextStripper.setEndPage(sourcePageNumber + 1);
pdfTextStripper.getText(source);
PDRectangle targetPageCropBox = targetPage.getCropBox();
float yOffset = targetPageCropBox.getUpperRightY() + targetPageCropBox.getLowerLeftY();
try (PDPageContentStream contentStream = new PDPageContentStream(target, targetPage, AppendMode.APPEND, true, true)) {
contentStream.beginText();
float x = 0;
float y = yOffset;
for (TextPosition position: allTextPositions) {
contentStream.setFont(position.getFont(), position.getFontSizeInPt());
contentStream.newLineAtOffset(position.getX() - x, - (position.getY() - y));
contentStream.showText(position.getUnicode());
x = position.getX();
y = position.getY();
}
contentStream.endText();
}
}
You can apply it to a full document like this:
void copyText(PDDocument source, PDDocument target) throws IOException {
for (int i = 0; i < source.getNumberOfPages(); i++) {
PDPage sourcePage = source.getPage(i);
PDPage targetPage = null;
if (i < target.getNumberOfPages())
targetPage = target.getPage(i);
else
target.addPage(targetPage = new PDPage(sourcePage.getMediaBox()));
copyText(source, i, target, targetPage);
}
}
Applied to some example documents one gets:
As is to be expected, "text" that actually is drawn as bitmap image, is not copied.
Also beware, this is just a proof of concept and not a complete implementation. In particular page rotation and non-upright text in general are not supported. Also the only supported style attributes are text font and text size, other details (e.g. text color) are ignored. Different page geometries in source and target also will result in weird appearances.
In a comment you asked
If I wanted to replace some words in the source document with others in the target document and then format it, how could I modify the code?
To replace some glyphs while keeping everything else in place, is fairly easy. The TextPosition
instances in allTextPositions
are sorted the same way as the normal text output of the PdfTextStripper
is. To find certain words, therefore, you simply can search allTextPositions
for sequences of TextPosition
instances whose texts.
To allow for this, I extended the above methods to additionally accept a Consumer
that is called between retrieval and drawing:
void copyText(PDDocument source, PDDocument target, Consumer<List<TextPosition>> updater) throws IOException {
for (int i = 0; i < source.getNumberOfPages(); i++) {
PDPage sourcePage = source.getPage(i);
PDPage targetPage = null;
if (i < target.getNumberOfPages())
targetPage = target.getPage(i);
else
target.addPage(targetPage = new PDPage(sourcePage.getMediaBox()));
copyText(source, i, target, targetPage, updater);
}
}
void copyText(PDDocument source, int sourcePageNumber, PDDocument target, PDPage targetPage, Consumer<List<TextPosition>> updater) throws IOException {
List<TextPosition> allTextPositions = new ArrayList<>();
PDFTextStripper pdfTextStripper = new PDFTextStripper() {
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
allTextPositions.addAll(textPositions);
super.writeString(text, textPositions);
}
};
pdfTextStripper.setStartPage(sourcePageNumber + 1);
pdfTextStripper.setEndPage(sourcePageNumber + 1);
pdfTextStripper.getText(source);
if (updater != null)
updater.accept(allTextPositions);
PDRectangle targetPageCropBox = targetPage.getCropBox();
float yOffset = targetPageCropBox.getUpperRightY() + targetPageCropBox.getLowerLeftY();
try (PDPageContentStream contentStream = new PDPageContentStream(target, targetPage, AppendMode.APPEND, true, true)) {
contentStream.beginText();
float x = 0;
float y = yOffset;
for (TextPosition position: allTextPositions) {
contentStream.setFont(position.getFont(), position.getFontSizeInPt());
contentStream.newLineAtOffset(position.getX() - x, - (position.getY() - y));
contentStream.showText(position.getUnicode());
x = position.getX();
y = position.getY();
}
contentStream.endText();
}
}
(CopyFormattedPageText methods)
Now there are different strategies for replacing the glyphs. I've implemented two simple ones.
The first strategy replaces a search word in the list of TextPosition
objects by replacing the letters in each instance by the same number of letters from the replacement word (as long as available). This is appropriate if the word is especially formatted (e.g. spaced out) and this special formatting shall be kept.
void searchAndReplace(List<TextPosition> textPositions, String searchWord, String replacement) {
if (searchWord == null || searchWord.length() == 0)
return;
int candidatePosition = 0;
String candidate = "";
for (int i = 0; i < textPositions.size(); i++) {
candidate += textPositions.get(i).getUnicode();
if (!searchWord.startsWith(candidate)) {
candidate = "";
candidatePosition = i+1;
} else if (searchWord.length() == candidate.length()) {
for (int j = 0; j < searchWord.length();) {
TextPosition textPosition = textPositions.get(candidatePosition);
int length = textPosition.getUnicode().length();
String replacementHere = "";
if (length > 0 && j < replacement.length()) {
int end = j + length;
if (end > replacement.length())
end = replacement.length();
replacementHere = replacement.substring(j, end);
}
TextPosition newTextPosition = new TextPosition(textPosition.getRotation(),
textPosition.getPageWidth(), textPosition.getPageHeight(), textPosition.getTextMatrix(),
textPosition.getEndX(), textPosition.getEndY(), textPosition.getHeight(),
textPosition.getIndividualWidths()[0], textPosition.getWidthOfSpace(),
replacementHere,
textPosition.getCharacterCodes(), textPosition.getFont(),
textPosition.getFontSize(), (int) textPosition.getFontSizeInPt());
textPositions.set(candidatePosition, newTextPosition);
candidatePosition++;
j += length;
}
}
}
}
(CopyFormattedPageText method)
The second strategy replaces a search word in the list of TextPosition
objects by replacing the letters in the first instance by the whole replacement word and removing the other instances. This is appropriate if the word is not specially formatted (e.g. spaced out) and shall be printed naturally.
void searchAndReplaceAlternative(List<TextPosition> textPositions, String searchWord, String replacement) {
if (searchWord == null || searchWord.length() == 0)
return;
int candidatePosition = 0;
String candidate = "";
for (int i = 0; i < textPositions.size(); i++) {
candidate += textPositions.get(i).getUnicode();
if (!searchWord.startsWith(candidate)) {
candidate = "";
candidatePosition = i+1;
} else if (searchWord.length() == candidate.length()) {
TextPosition textPosition = textPositions.get(candidatePosition);
TextPosition newTextPosition = new TextPosition(textPosition.getRotation(),
textPosition.getPageWidth(), textPosition.getPageHeight(), textPosition.getTextMatrix(),
textPosition.getEndX(), textPosition.getEndY(), textPosition.getHeight(),
textPosition.getIndividualWidths()[0], textPosition.getWidthOfSpace(),
replacement,
textPosition.getCharacterCodes(), textPosition.getFont(),
textPosition.getFontSize(), (int) textPosition.getFontSizeInPt());
textPositions.set(candidatePosition, newTextPosition);
while (i > candidatePosition) {
textPositions.remove(i--);
}
candidatePosition++;
}
}
}
(CopyFormattedPageText method)
You use these strategies like this in your copyText
calls:
copyText(source, target, list -> searchAndReplace(list, "Test", "Art"));
...
copyText(source, target, list -> searchAndReplaceAlternative(list, "DOCUMENT", "COSTUME"));
(CopyFormattedPageText test methods)
Beware, though, if the fonts used are subset-embedded, the glyphs for the replacement text may not exist in that font. In that case create and use another font that does include the replacement glyphs4. Also the replacement should be as long.
As you did not mention a specific PDFBox version, I used the current 3.0.1.