Search code examples
javapdfpdfclown

PDFClown Different font-size in one line


I´m using PDFClown to analyze a PDF Document. In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. Is there a workaround?

This is the Code:

    while(_level.moveNext()) {
        ContentObject content = _level.getCurrent();
        if(content instanceof Text) {
            ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
            for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
                List<CharInfo> chars = new ArrayList<>();
                for(TextChar textChar : textString.getTextChars()) {
                    chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
                }
            }
        }
        else if(content instanceof XObject) {
            // Scan the external level
            if(((XObject)content).getScanner(_level)!=null){
                getContentLines(((XObject)content).getScanner(_level));
            }
        }
        else if(content instanceof ContainerObject){
            // Scan the inner level
            if(_level.getChildLevel()!=null){
                getContentLines(_level.getChildLevel());
            }
        }
    } 

Here is an example PDFDocument:

Example

In this Document I marked two text chunks which both contains the word "million". When analyzing the size of each char in both "million" the following happens:

  1. "m" in the first mark has the height : 14,50 and the width : 8,5
  2. "i" in the first mark has the height: 14,50 and thw width: 3,0
  3. "l" in the first mark has the height : 14,50 and the width 3,0
  4. "m" in the second mark has the height: 10,56 and the width: 6,255
  5. "i" in the second mark has the height: 10,56 and the width: 2,23
  6. "l" in the second mark has the height: 10,56 and the width: 2,23

Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different.


Solution

  • The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. I.e. it assumes that these structures only intermingle as

    begin-marked-content
    save-graphics-state
    restore-graphics-state
    end-marked-content
    

    or

    save-graphics-state
    begin-marked-content
    end-marked-content
    restore-graphics-state
    

    but never as

    save-graphics-state
    begin-marked-content
    restore-graphics-state
    end-marked-content
    

    or

    begin-marked-content
    save-graphics-state
    end-marked-content
    restore-graphics-state.
    

    Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like.

    E.g. in the document at hand there are sequences like this:

    q
    [...1...]
    /P <</MCID 0 >>BDC 
    Q
    [...2...]
    EMC
    

    Here [...1...] is contained in the save/restore graphics state block enveloped by q and Q and [...2...] is contained in the marked content block enveloped by /P <</MCID 0 >>BDC and EMC.

    Due to the wrong assumption, though, and the way /P <</MCID 0 >>BDC and Q are arranged, PDF Clown parses the above as [...1...] and an empty marked content block and [...2...] being contained in a save/restore graphics state block.

    Thus, if there are changes in the graphics state inside [...2...], PDF Clown assumes them limited to the lines above while they actually are not.


    The only easy way I found to repair this was to disable the marked content parsing in PDF Clown.

    To do this I changed org.pdfclown.documents.contents.tokens.ContentParser as follows:

    1. In parseContentObjects() I disablked the contentObject instanceof EndMarkedContent option:

        public List<ContentObject> parseContentObjects(
          )
        {
          final List<ContentObject> contentObjects = new ArrayList<ContentObject>();
          while(moveNext())
          {
            ContentObject contentObject = parseContentObject();
            // Multiple-operation graphics object end?
            if(contentObject instanceof EndText // Text.
              || contentObject instanceof RestoreGraphicsState // Local graphics state.
             /* || contentObject instanceof EndMarkedContent // End marked-content sequence. */
              || contentObject instanceof EndInlineImage) // Inline image.
              return contentObjects;
      
            contentObjects.add(contentObject);
          }
          return contentObjects;
        }
      
    2. In parseContentObject I removed the if(operation instanceof BeginMarkedContent) branch:

        public ContentObject parseContentObject(
          )
        {
          final Operation operation = parseOperation();
          if(operation instanceof PaintXObject) // External object.
            return new XObject((PaintXObject)operation);
          else if(operation instanceof PaintShading) // Shading.
            return new Shading((PaintShading)operation);
          else if(operation instanceof BeginSubpath
            || operation instanceof DrawRectangle) // Path.
            return parsePath(operation);
          else if(operation instanceof BeginText) // Text.
            return new Text(
              parseContentObjects()
              );
          else if(operation instanceof SaveGraphicsState) // Local graphics state.
            return new LocalGraphicsState(
              parseContentObjects()
              );
       /*   else if(operation instanceof BeginMarkedContent) // Marked-content sequence.
            return new MarkedContent(
              (BeginMarkedContent)operation,
              parseContentObjects()
              );
       */   else if(operation instanceof BeginInlineImage) // Inline image.
            return parseInlineImage();
          else // Single operation.
            return operation;
        }
      

    With these changes in place, the character sizes are properly extracted.


    As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. the org.pdfclown.documents.contents.fonts.Font method getHeight(char):

      /**
        Gets the unscaled height of the given character.
    
        @param textChar
          Character whose height has to be calculated.
      */
      public final double getHeight(
        char textChar
        )
      {
        /*
          TODO: Calculate actual text height through glyph bounding box.
        */
        if(textHeight == -1)
        {textHeight = getAscent() - getDescent();}
        return textHeight;
      }
    

    Individual character height calculation still is a TODO.