Search code examples
javaocrtesseracttess4jcolor-depth

Tesseract produced searchable PDF with 8bit depth back to 1bit (tess4j)


I have a PDFs with 1-bit color depth as an input for OCR processing (tess4j, 5.0.0) with approx. 30kb each. After processing, each PDF has 120-130kb and is saved with 8-bit color depth, which is probably main cause of file size increase.

I would like to know if there is a way to set color depth within Tesseract or associated libs or there is another way to handle this.

ITesseract instance = new Tesseract();
instance.setDatapath("/path/to/tessdata");
instance.setPageSegMode(ITessAPI.TessPageSegMode.PSM_SINGLE_COLUMN);
List<ITesseract.RenderedFormat> formats = new ArrayList<(Arrays.asList(ITesseract.RenderedFormat.PDF));
instance.createDocumentsWithResults(inputPdf.getPath(), "/path/to/result", formats, ITessAPI.TessPageIteratorLevel.RIL_WORD);

Any help greatly appreciated.


Solution

  • Eventually, I came up with a workaround - you can specify the output by defining RendererFormat. I changed that from PDF to PDF_TEXTONLY, which produced a pdf (~7kb) with the text in the right position but without the original scan/image.

    List<ITesseract.RenderedFormat> formats = new ArrayList<>(Arrays.asList(ITesseract.RenderedFormat.PDF_TEXTONLY));
    

    Then I used PDFBox functionality to extract image/images from original pdf. It is possible to specify DPI which also helps to reduce the file size.

    PDDocument document = PDDocument.load(inputPdf);
    PDFRenderer pdfRenderer = new PDFRenderer(document);
    for (int page = 0; page < document.getNumberOfPages(); ++page) {
         BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
         ImageIOUtil.writeImage(bim, "/path/to/pics/picture_" + page + ".png", 300);
    }
    document.close();
    

    Then just add an image to the text-only pdf as a watermark (How to insert a image under the text as a pdf background using iText?). This helped reduce the size from 120-130 kb to 60 kb with 300 DPI (even less with lower DPI), which is great given that it is an OCR processed pdf with an original size of 30kb. I know this is not the best solution and I'll be happy for any other contribution or answer.