Search code examples
javafontsitext

What is different the two font?


When i print TextRenderInfo.getFont().getPostscriptFontName() in two pdf file, it will be printed AAAAAD+SourceHanSansCN-Normal and BIISMY+SourceHanSansCN-Normal.

I known SourceHanSansCN-Normal is format of FontName-FontScript, but what is AAAAAD ? Is not like font family.

Example Code:

public class CheckPdfAllFontTest implements TextExtractionStrategy {

    public static final String SRC = "ownTestFile.pdf";

    @Override
    public String getResultantText() {
        return null;
    }

    @Override
    public void beginTextBlock() {

    }

    @Override
    public void renderText(TextRenderInfo textRenderInfo) {
        String x = textRenderInfo.getFont().getPostscriptFontName();
        String text = textRenderInfo.getText();
        System.out.println(text + "=====" + x);
    }

    @Override
    public void endTextBlock() {

    }

    @Override
    public void renderImage(ImageRenderInfo imageRenderInfo) {

    }

    public static void main(String[] args) throws IOException, DocumentException {
        new CheckPdfAllFontTest().parse(SRC);
    }

    public void parse(String filename) throws IOException, IOException {
        int pageNumber = 1;
        PdfReader reader = new PdfReader(filename);
        System.out.println(PdfTextExtractor.getTextFromPage(reader, pageNumber, new CheckPdfAllFontTest()));
        reader.close();
    }
}

Itext pdf version:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.8</version>
</dependency>

The two pdf is exported with "embed font" and "not embed font" setting from a Power Point File.

  • "AAAAAD+SourceHanSansCN-Normal" from "embed font" pdf file.
  • "BIISMY+SourceHanSansCN-Normal" from "not embed font" pdf file.

I am collecting the fonts used in pdf, but I found that there are fonts in this format. I don’t know what is before the ‘+’. What is its definition?


Solution

  • According to the PDF specification:

    9.9.2 Font subsets

    PDF documents may include subsets of PDF fonts whose Subtype is Type1, TrueType or OpenType. The font and font descriptor that describe a font subset are slightly different from those of ordinary fonts. These differences allow a PDF processor to recognise font subsets and to merge documents containing different subsets of the same font. (For more information on font descriptors, see 9.8, "Font descriptors".)

    For a font subset, the PostScript name of the font, that is, the value of the font’s BaseFont entry and the font descriptor’s FontName entry, shall begin with a tag followed by a plus sign (+) followed by the PostScript name of the font from which the subset was created. The tag shall consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets of the same font in the same PDF file shall have different tags. The glyph name .notdef shall be defined in the font subset.

    NOTE It is recommended that PDF processors treat multiple subset fonts as completely independent entities, even if they appear to have been created from the same original font.

    EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font.

    (ISO 32000-2)

    Thus, AAAAAD+SourceHanSansCN-Normal and BIISMY+SourceHanSansCN-Normal most likely are different subsets of the same source font.