Search code examples
javaocrtesseracttess4j

How to detect text blocks and columns in pdf with tess4j


I'm new to Tesseract (tess4j), managed to used main features like reading the text or getting the words positions both from image or pdf, rotating etc..

I can't find, and not sure if it is possible to easily detect blocks of text (paragraphs or columns). Also, if there are some other blocks in pdf like images or something else, is it possible to get it somehow, or at least to get the position of the block (box).


Solution

  • You can use TessBaseAPIGetComponentImages API method, as follows:

    Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, TRUE, null, null);
    

    Check Tess4J unit tests for complete examples.