input pdf image description here first one is input and this is output pdf pic output pdf image description hereI need help regarding text positioning during recreation of pdf. I have extracted all text using text stripper and able to draw on new pdf with correct font and font size. But Not able to draw text at its correct position.
//this is how I extracting TextPosition data
protected void processTextPosition(TextPosition text) {
textPositionPDGraphicsStatesMap.put(text, getGraphicsState());
PDGraphicsState state = getGraphicsState();
PDTextState textState = state.getTextState();
float fontSize = textState.getFontSize();
float horizontalScaling = textState.getHorizontalScaling() / 100f;
float charSpacing = textState.getCharacterSpacing();
// put the text state parameters into matrix form
Matrix parameters = new Matrix(
fontSize * horizontalScaling, 0, // 0
0, fontSize, // 0
0, textState.getRise()); // 1
// text rendering matrix (text space -> device space)
Matrix ctm = state.getCurrentTransformationMatrix();
Matrix textRenderingMatrix = parameters.multiply(text.getTextMatrix()).multiply(ctm);
TextPositionsInfo txtInfo = new TextPositionsInfo();
txtInfo.xDir = text.getXDirAdj();
txtInfo.yDir = text.getYDirAdj();
txtInfo.x = textRenderingMatrix.getTranslateX();
txtInfo.y = textRenderingMatrix.getTranslateY();
txtInfo.textMatrix = textRenderingMatrix;
txtInfo.height= text.getHeightDir();
txtInfo.width = text.getWidthDirAdj();
txtInfo.unicode = text.getUnicode();
txtInfo.fontName = text.getFont().getFontDescriptor().getFontName();
txtInfo.fontSize = getActualFontSize(text, getGraphicsState());
pdfGraphicContent.textPositions.add(txtInfo);
}
//here I am placing each char and set to content stream
private void addTextCharByChar(String string, List<TextPositionsInfo> textinfoList, TextBBoxinfo textBBoxinfo,PDPage page) throws IOException {
PDResources res = page.getResources();
currentContentStream.beginText();
if (textBBoxinfo._ElementType.toLowerCase().equals("h2")) {
beginMarkedConent(COSName.P);
for(TextPositionsInfo textInfo : textinfoList) {
PDFont font = getFont(res, textInfo.fontName);
currentContentStream.setFont(font, textInfo.fontSize);
Matrix _tm = textInfo.textMatrix;
currentContentStream.newLineAtOffset(_tm.getTranslateX(), _tm.getTranslateY());
currentContentStream.setTextMatrix(_tm);
currentContentStream.showText(textInfo.unicode);
}
currentContentStream.endMarkedContent();
addContentToCurrentSection(COSName.P, StandardStructureTypes.H2);
}else if (textBBoxinfo._ElementType.toLowerCase().equals("h1")) {
beginMarkedConent(COSName.P);
for(TextPositionsInfo textInfo : textinfoList) {
PDFont font = getFont(res, textInfo.fontName);
currentContentStream.setFont(font, textInfo.fontSize);
currentContentStream.newLineAtOffset(textInfo.textMatrix.getTranslateX(),
textInfo.textMatrix.getTranslateY());
currentContentStream.setTextMatrix(textInfo.textMatrix);
currentContentStream.showText(textInfo.unicode);
}
currentContentStream.endMarkedContent();
addContentToCurrentSection(COSName.P, StandardStructureTypes.H1);
}
currentContentStream.endText();
}
}
There is an error in the code you show. But the as I couldn't reproduce the error as it shows in your screenshots, I assume that there is another error somewhere in the code you don't show.
Unfortunately you neither provided self-contained code nor your example PDF. To check it, therefore, I had to change the code a bit to make it runnable. Furthermore, I had to select a test document of my own; I actually found one looking very much like your screenshot.
In processTextPosition
you try to calculate the text rendering matrix like this:
PDGraphicsState state = getGraphicsState();
PDTextState textState = state.getTextState();
float fontSize = textState.getFontSize();
float horizontalScaling = textState.getHorizontalScaling() / 100f;
float charSpacing = textState.getCharacterSpacing();
// put the text state parameters into matrix form
Matrix parameters = new Matrix(
fontSize * horizontalScaling, 0, // 0
0, fontSize, // 0
0, textState.getRise()); // 1
// text rendering matrix (text space -> device space)
Matrix ctm = state.getCurrentTransformationMatrix();
Matrix textRenderingMatrix = parameters.multiply(text.getTextMatrix()).multiply(ctm);
This looks like the right way to calculate the text rendering matrix from the available data. EXCEPT if you read the documentation of the TextPosition.getTextMatrix
method:
/**
* The matrix containing the starting text position and scaling. Despite the name, it is not the
* text matrix set by the "Tm" operator, it is really the effective text rendering matrix (which
* is dependent on the current transformation matrix (set by the "cm" operator), the text matrix
* (set by the "Tm" operator), the font size (set by the "Tf" operator) and the page cropbox).
*
* @return The Matrix containing the starting text position
*/
public Matrix getTextMatrix()
Thus, text.getTextMatrix()
already is the matrix you want to calculate here. So you can either replace the whole block above by
Matrix textRenderingMatrix = text.getTextMatrix();
or (if you really want to calculate the text rendering matrix yourself) use state.getTextMatrix()
instead of text.getTextMatrix()
in that block.
I changed your processTextPosition
override to:
protected void processTextPosition(TextPosition text) {
// textPositionPDGraphicsStatesMap.put(text, getGraphicsState());
PDGraphicsState state = getGraphicsState();
PDTextState textState = state.getTextState();
float fontSize = textState.getFontSize();
float horizontalScaling = textState.getHorizontalScaling() / 100f;
float charSpacing = textState.getCharacterSpacing();
// put the text state parameters into matrix form
Matrix parameters = new Matrix(
fontSize * horizontalScaling, 0, // 0
0, fontSize, // 0
0, textState.getRise()); // 1
// text rendering matrix (text space -> device space)
Matrix ctm = state.getCurrentTransformationMatrix();
Matrix textRenderingMatrix = parameters.multiply(/*text*/state.getTextMatrix()).multiply(ctm);
TextPositionsInfo txtInfo = new TextPositionsInfo();
txtInfo.xDir = text.getXDirAdj();
txtInfo.yDir = text.getYDirAdj();
txtInfo.x = textRenderingMatrix.getTranslateX();
txtInfo.y = textRenderingMatrix.getTranslateY();
txtInfo.textMatrix = textRenderingMatrix;
txtInfo.height= text.getHeightDir();
txtInfo.width = text.getWidthDirAdj();
txtInfo.unicode = text.getUnicode();
txtInfo.fontName = text.getFont().getFontDescriptor().getFontName();
txtInfo.fontSize = getActualFontSize(text, getGraphicsState());
/*pdfGraphicContent.*/textPositions.add(txtInfo);
// font provisioning not provided by OP. Simple stub.
targetPage.getResources().put(COSName.getPDFName(txtInfo.fontName), text.getFont());
}
// not provided by OP. Simple stub
private float getActualFontSize(TextPosition text, PDGraphicsState graphicsState) {
return text.getFontSize();
}
(CopyFormattedPageText method copyTextLikeNitishKumar
)
The main change here is that I fixed the error (see above) and added the font to the target page resources; as you do not show how you fill the target page resources and match fonts with names, this was the easiest way to improvise this. Beware, this is not a good way to improvise this, this might lose quite some font information...
I then changed your addTextCharByChar
to:
private void addTextCharByChar(/*String string,*/ List<TextPositionsInfo> textinfoList, /*TextBBoxinfo textBBoxinfo,*/ PDPage page, PDPageContentStream currentContentStream) throws IOException {
PDResources res = page.getResources();
currentContentStream.beginText();
// if (textBBoxinfo._ElementType.toLowerCase().equals("h2")) {
currentContentStream.beginMarkedContent(COSName.P);
for(TextPositionsInfo textInfo : textinfoList) {
PDFont font = getFont(res, textInfo.fontName);
currentContentStream.setFont(font, /*textInfo.fontSize*/1);
Matrix _tm = textInfo.textMatrix;
currentContentStream.newLineAtOffset(_tm.getTranslateX(), _tm.getTranslateY());
currentContentStream.setTextMatrix(_tm);
currentContentStream.showText(textInfo.unicode);
}
currentContentStream.endMarkedContent();
// addContentToCurrentSection(COSName.P, StandardStructureTypes.H2);
// } else if (textBBoxinfo._ElementType.toLowerCase().equals("h1")) {
// beginMarkedConent(COSName.P);
// for(TextPositionsInfo textInfo : textinfoList) {
// PDFont font = getFont(res, textInfo.fontName);
// currentContentStream.setFont(font, textInfo.fontSize);
// currentContentStream.newLineAtOffset(textInfo.textMatrix.getTranslateX(), textInfo.textMatrix.getTranslateY());
// currentContentStream.setTextMatrix(textInfo.textMatrix);
// currentContentStream.showText(textInfo.unicode);
// }
// currentContentStream.endMarkedContent();
// addContentToCurrentSection(COSName.P, StandardStructureTypes.H1);
// }
currentContentStream.endText();
}
// not provided by OP. Simple stub
private PDFont getFont(PDResources res, String fontName) throws IOException {
return res.getFont(COSName.getPDFName(fontName));
}
The main change here is dropping code paths that depend on extra information you didn't share, like textBBoxinfo
. Furthermore, I use 1
as font size in the currentContentStream.setFont
because you thereafter set the text matrix to the original text rendering matrix which already contains a scaling by the text font size.
Running the code for this file (which looks very much like your screenshot) results in:
So it looks like the code you shared works after fixing the text rendering matrix and font size.
The output you showed a screenshot of does not match the identified error. The error usually would print much too large letters, not a few letters at the correct size and dropping the others.
Thus, there appear to be other errors in the code you don't show. I can only guess their causes. My guesses would be:
textBBoxinfo._ElementType
is only "H1" or "H2" for a few letters, the ones you see in your output. As there is no code path for drawing text with other element type values, most of the letters aren't drawn at all.getFont
only returns a useful font for the letters you see in your output.getActualFontSize
returns a sensible size only for a few letters, the ones you see in the output. The other letters are too small or not drawn at all in the output.