Search code examples
javapdfboxpdfdocument

How to Fix text positioning during recreation of pdf


input pdf image description here first one is input and this is output pdf pic output pdf image description hereI need help regarding text positioning during recreation of pdf. I have extracted all text using text stripper and able to draw on new pdf with correct font and font size. But Not able to draw text at its correct position.

//this is how I extracting TextPosition data

protected void processTextPosition(TextPosition text) {
         textPositionPDGraphicsStatesMap.put(text, getGraphicsState());
         PDGraphicsState state = getGraphicsState();
         PDTextState textState = state.getTextState();
         float fontSize = textState.getFontSize();
         float horizontalScaling = textState.getHorizontalScaling() / 100f;
         float charSpacing = textState.getCharacterSpacing();

         // put the text state parameters into matrix form
         Matrix parameters = new Matrix(
                    fontSize * horizontalScaling, 0, // 0
                    0, fontSize,                     // 0
                    0, textState.getRise());         // 1
        
        // text rendering matrix (text space -> device space)
        Matrix ctm = state.getCurrentTransformationMatrix();
        Matrix textRenderingMatrix = parameters.multiply(text.getTextMatrix()).multiply(ctm);
        
        TextPositionsInfo txtInfo = new TextPositionsInfo();
        txtInfo.xDir = text.getXDirAdj();
        txtInfo.yDir = text.getYDirAdj();
        txtInfo.x =  textRenderingMatrix.getTranslateX();
        txtInfo.y = textRenderingMatrix.getTranslateY();
        txtInfo.textMatrix = textRenderingMatrix;
        txtInfo.height= text.getHeightDir();
        txtInfo.width = text.getWidthDirAdj(); 
        txtInfo.unicode = text.getUnicode();
        txtInfo.fontName = text.getFont().getFontDescriptor().getFontName();
        txtInfo.fontSize = getActualFontSize(text, getGraphicsState());
        pdfGraphicContent.textPositions.add(txtInfo);
        
    }

//here I am placing each char and set to content stream

private void addTextCharByChar(String string, List<TextPositionsInfo> textinfoList, TextBBoxinfo textBBoxinfo,PDPage page) throws IOException {

         PDResources res = page.getResources();
        currentContentStream.beginText(); 
        if (textBBoxinfo._ElementType.toLowerCase().equals("h2")) {
            beginMarkedConent(COSName.P);
            for(TextPositionsInfo textInfo : textinfoList) {
                PDFont font = getFont(res, textInfo.fontName);
                currentContentStream.setFont(font, textInfo.fontSize);
                Matrix _tm = textInfo.textMatrix;
                currentContentStream.newLineAtOffset(_tm.getTranslateX(), _tm.getTranslateY());
                currentContentStream.setTextMatrix(_tm);
                currentContentStream.showText(textInfo.unicode);
            }
            currentContentStream.endMarkedContent();
            addContentToCurrentSection(COSName.P, StandardStructureTypes.H2);
            
        }else if (textBBoxinfo._ElementType.toLowerCase().equals("h1")) {
            beginMarkedConent(COSName.P);
            for(TextPositionsInfo textInfo : textinfoList) {
                PDFont font = getFont(res, textInfo.fontName);
                currentContentStream.setFont(font, textInfo.fontSize);
                currentContentStream.newLineAtOffset(textInfo.textMatrix.getTranslateX(), 
              textInfo.textMatrix.getTranslateY());
               currentContentStream.setTextMatrix(textInfo.textMatrix);
                currentContentStream.showText(textInfo.unicode);
            }
            currentContentStream.endMarkedContent();
            addContentToCurrentSection(COSName.P, StandardStructureTypes.H1);
            
        }
        currentContentStream.endText();
        }
    }


Solution

  • There is an error in the code you show. But the as I couldn't reproduce the error as it shows in your screenshots, I assume that there is another error somewhere in the code you don't show.

    In detail

    Unfortunately you neither provided self-contained code nor your example PDF. To check it, therefore, I had to change the code a bit to make it runnable. Furthermore, I had to select a test document of my own; I actually found one looking very much like your screenshot.

    The error

    In processTextPosition you try to calculate the text rendering matrix like this:

     PDGraphicsState state = getGraphicsState();
     PDTextState textState = state.getTextState();
     float fontSize = textState.getFontSize();
     float horizontalScaling = textState.getHorizontalScaling() / 100f;
     float charSpacing = textState.getCharacterSpacing();
    
     // put the text state parameters into matrix form
     Matrix parameters = new Matrix(
                fontSize * horizontalScaling, 0, // 0
                0, fontSize,                     // 0
                0, textState.getRise());         // 1
    
    // text rendering matrix (text space -> device space)
    Matrix ctm = state.getCurrentTransformationMatrix();
    Matrix textRenderingMatrix = parameters.multiply(text.getTextMatrix()).multiply(ctm);
    

    This looks like the right way to calculate the text rendering matrix from the available data. EXCEPT if you read the documentation of the TextPosition.getTextMatrix method:

    /**
     * The matrix containing the starting text position and scaling. Despite the name, it is not the
     * text matrix set by the "Tm" operator, it is really the effective text rendering matrix (which
     * is dependent on the current transformation matrix (set by the "cm" operator), the text matrix
     * (set by the "Tm" operator), the font size (set by the "Tf" operator) and the page cropbox).
     *
     * @return The Matrix containing the starting text position
     */
    public Matrix getTextMatrix()
    

    Thus, text.getTextMatrix() already is the matrix you want to calculate here. So you can either replace the whole block above by

    Matrix textRenderingMatrix = text.getTextMatrix();
    

    or (if you really want to calculate the text rendering matrix yourself) use state.getTextMatrix() instead of text.getTextMatrix() in that block.

    The changed code

    I changed your processTextPosition override to:

                protected void processTextPosition(TextPosition text) {
    //                textPositionPDGraphicsStatesMap.put(text, getGraphicsState());
                    PDGraphicsState state = getGraphicsState();
                    PDTextState textState = state.getTextState();
                    float fontSize = textState.getFontSize();
                    float horizontalScaling = textState.getHorizontalScaling() / 100f;
                    float charSpacing = textState.getCharacterSpacing();
    
                    // put the text state parameters into matrix form
                    Matrix parameters = new Matrix(
                               fontSize * horizontalScaling, 0, // 0
                               0, fontSize,                     // 0
                               0, textState.getRise());         // 1
                   
                   // text rendering matrix (text space -> device space)
                   Matrix ctm = state.getCurrentTransformationMatrix();
                   Matrix textRenderingMatrix = parameters.multiply(/*text*/state.getTextMatrix()).multiply(ctm);
                   
                   TextPositionsInfo txtInfo = new TextPositionsInfo();
                   txtInfo.xDir = text.getXDirAdj();
                   txtInfo.yDir = text.getYDirAdj();
                   txtInfo.x =  textRenderingMatrix.getTranslateX();
                   txtInfo.y = textRenderingMatrix.getTranslateY();
                   txtInfo.textMatrix = textRenderingMatrix;
                   txtInfo.height= text.getHeightDir();
                   txtInfo.width = text.getWidthDirAdj(); 
                   txtInfo.unicode = text.getUnicode();
                   txtInfo.fontName = text.getFont().getFontDescriptor().getFontName();
                   txtInfo.fontSize = getActualFontSize(text, getGraphicsState());
                   /*pdfGraphicContent.*/textPositions.add(txtInfo);
    
    // font provisioning not provided by OP. Simple stub.
    targetPage.getResources().put(COSName.getPDFName(txtInfo.fontName), text.getFont());
                }
    
                // not provided by OP. Simple stub
                private float getActualFontSize(TextPosition text, PDGraphicsState graphicsState) {
                    return text.getFontSize();
                }
    

    (CopyFormattedPageText method copyTextLikeNitishKumar)

    The main change here is that I fixed the error (see above) and added the font to the target page resources; as you do not show how you fill the target page resources and match fonts with names, this was the easiest way to improvise this. Beware, this is not a good way to improvise this, this might lose quite some font information...

    I then changed your addTextCharByChar to:

        private void addTextCharByChar(/*String string,*/ List<TextPositionsInfo> textinfoList, /*TextBBoxinfo textBBoxinfo,*/ PDPage page, PDPageContentStream currentContentStream) throws IOException {
            PDResources res = page.getResources();
    
            currentContentStream.beginText(); 
    //        if (textBBoxinfo._ElementType.toLowerCase().equals("h2")) {
                currentContentStream.beginMarkedContent(COSName.P);
                for(TextPositionsInfo textInfo : textinfoList) {
                    PDFont font = getFont(res, textInfo.fontName);
                    currentContentStream.setFont(font, /*textInfo.fontSize*/1);
                    Matrix _tm = textInfo.textMatrix;
                    currentContentStream.newLineAtOffset(_tm.getTranslateX(), _tm.getTranslateY());
                    currentContentStream.setTextMatrix(_tm);
                    currentContentStream.showText(textInfo.unicode);
                }
                currentContentStream.endMarkedContent();
    //            addContentToCurrentSection(COSName.P, StandardStructureTypes.H2);
    //        } else if (textBBoxinfo._ElementType.toLowerCase().equals("h1")) {
    //            beginMarkedConent(COSName.P);
    //            for(TextPositionsInfo textInfo : textinfoList) {
    //                PDFont font = getFont(res, textInfo.fontName);
    //                currentContentStream.setFont(font, textInfo.fontSize);
    //                currentContentStream.newLineAtOffset(textInfo.textMatrix.getTranslateX(), textInfo.textMatrix.getTranslateY());
    //                currentContentStream.setTextMatrix(textInfo.textMatrix);
    //                currentContentStream.showText(textInfo.unicode);
    //            }
    //            currentContentStream.endMarkedContent();
    //            addContentToCurrentSection(COSName.P, StandardStructureTypes.H1);
    //        }
            currentContentStream.endText();
        }
    
        // not provided by OP. Simple stub
        private PDFont getFont(PDResources res, String fontName) throws IOException {
            return res.getFont(COSName.getPDFName(fontName));
        }
    

    The main change here is dropping code paths that depend on extra information you didn't share, like textBBoxinfo. Furthermore, I use 1 as font size in the currentContentStream.setFont because you thereafter set the text matrix to the original text rendering matrix which already contains a scaling by the text font size.

    The result

    Running the code for this file (which looks very much like your screenshot) results in:

    input output
    input output

    So it looks like the code you shared works after fixing the text rendering matrix and font size.

    Possible additional issues

    The output you showed a screenshot of does not match the identified error. The error usually would print much too large letters, not a few letters at the correct size and dropping the others.

    Thus, there appear to be other errors in the code you don't show. I can only guess their causes. My guesses would be:

    • The textBBoxinfo._ElementType is only "H1" or "H2" for a few letters, the ones you see in your output. As there is no code path for drawing text with other element type values, most of the letters aren't drawn at all.
    • Your font handling is erroneous and your getFont only returns a useful font for the letters you see in your output.
    • Your getActualFontSize returns a sensible size only for a few letters, the ones you see in the output. The other letters are too small or not drawn at all in the output.