Search code examples
pdfclown

pdfclown highlighting doesn't work for some pdf files


I am using the pdfclown library to highlight some text inside the pdf file but for some reason, I get nullpointerexception error when I run TextHighlightSample.

 [java] java.lang.NullPointerException
 [java]     at java.util.Hashtable.hash(Hashtable.java:239)
 [java]     at java.util.Hashtable.put(Hashtable.java:519)
 [java]     at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:139)
 [java]     at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
 [java]     at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
 [java]     at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
 [java]     at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
 [java]     at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
 [java]     at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
 [java]     at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
 [java]     at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
 [java]     at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
 [java]     at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
 [java]     at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
 [java]     at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
 [java]     at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
 [java]     at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
 [java]     at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
 [java]     at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
 [java]     at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
 [java]     at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
 [java]     at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
 [java]     at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
 [java]     at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
 [java]     at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
 [java]     at org.pdfclown.samples.cli.TextHighlightSample.run(TextHighlightSample.java:56)
 [java]     at org.pdfclown.samples.cli.SampleLoader.run(SampleLoader.java:140)
 [java]     at org.pdfclown.samples.cli.SampleLoader.main(SampleLoader.java:56)

Does anyone know how to solve this problem?


Solution

  • The foreground issue

    The foreground issue is that PdfClown in SimpleFont.onLoad() (while reading the Widths from the font dictionary into its own structures) assumes that it has a glyphIndexes entry for each codes value for a key from the FirstChar-based indices in the Widths array:

      if(glyphWidthObjects != null)
      {
        ByteArray charCode = new ByteArray(
          new byte[]
          {(byte)((PdfInteger)getBaseDataObject().get(PdfName.FirstChar)).getIntValue()}
          );
        for(PdfDirectObject glyphWidthObject : glyphWidthObjects)
        {
          int glyphWidth = ((PdfNumber<?>)glyphWidthObject).getIntValue();
          if(glyphWidth > 0)
          {
            Integer code = codes.get(charCode);
            if(code != null)
            {
              glyphWidths.put(
                glyphIndexes.get(code),         //<<<<<<<<<<<<<<<<<<<<<<
                glyphWidth
                );
            }
          }
          charCode.data[0]++;
        }
      }
    

    If you check for null here, e.g. replacing

            if(code != null)
    

    by

            if(code != null && glyphIndexes.get(code) != null)
    

    you will get rid of the NullPointerException.

    Usually there are glyphIndexes entries for all those values. Thus, usually you don't get the NullPointerException here. But PdfClown in its attempt to be able to extract as much as possible uses a mixture of information from the PDF objects and the embedded font objects, and there still seem to be some shortcomings in the coordination of those information, e.g. in case of your document:

    The background issue

    While constructing a TrueTypeFont object for the font SourceSansPro-Regular PdfClown

    • (Font.load) tries to read a ToUnicode map to get a mapping from character codes to Unicode and put it into codes; unfortunately the font has no ToUnicode map; thus, codes remains null;
    • (OpenFontParser construction in TrueTypeFont.loadEncoding initially called by SimpleFont.onLoad) tries to read information from the embedded font file; among other data it retrieved a mapping 32..213 -> 0..44 mapping character codes to in-font glyph indices;
    • (still in TrueTypeFont.loadEncoding initially called by SimpleFont.onLoad) sets the font object's glyphIndexes member to that map; if there was a codes mapping already now, this would be used here to change the mapping to a mapping Unicode -> 0..44; but codes is null (see above), so glyphIndexes remains as is;
    • (still in TrueTypeFont.loadEncoding initially called by SimpleFont.onLoad) as there is no codes mapping yet, it creates one based on the MacRomanEncoding entry from the PDF font dictionary;
    • (still in TrueTypeFont.loadEncoding initially called by SimpleFont.onLoad) if there were no glyphIndexes yet, it would derive one from the current codes mapping and the Widths array; but we already have one, so it remains as is;
    • (SimpleFont.onLoad) finally it tries to put the contents of the PDF font dictionary's Widths array into its glyphWidths map. The code (see above) assumes that glyphIndexes is a mapping of Unicode codes and, therefore, translates them using codes first. Unfortunately glyphIndexes here is not from Unicode codes but from character codes. Thus the failure observed above occurs.

    Font extraction in PdfClown 0.1.3 is in need of clean-up. It tries to make use of information from both the PDF objects and the embedded fonts (which is a good idea) but for some situations like here shoots itself in the foot.

    But it's still an early 0.x version after all, so some issues are to be expected...