I am using 'cz.adaptech.tesseract4android:tesseract4android:4.3.0'
in my Android project.
Is it possible to get bounding box with text data like in example below?
(32, 24, 60, 17) Maitre
(100, 24, 82, 19) corbeau,
(191, 28, 29, 13) sur
(227, 28, 22, 12) un
(257, 24, 50, 17) arbre
(315, 24, 70, 21) perché,
(79, 49, 58, 17) Tenait
Official sample shows how to get plain text only, not boxes with text inside:
TessBaseAPI tess = getTessBaseAPI(path, context);
String text = tess.getUTF8Text();
To get bounding box with text use next code:
TessBaseAPI tess = new TessBaseAPI();
// Given path must contain subdirectory `tessdata` where are `*.traineddata` language files
String dataPath = context.getExternalFilesDir(null).getPath() + "/OCRme/";
// Initialize API for specified language (can be called multiple times during Tesseract lifetime)
if (!tess.init(dataPath, "eng", TessBaseAPI.OEM_TESSERACT_LSTM_COMBINED)) {
throw new IOException("Error initializing Tesseract (wrong data path or language)");
}
// Specify image and then recognize it and get result (can be called multiple times during Tesseract lifetime)
tess.setImage(bitmap);
tess.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
tess.getUTF8Text();
ResultIterator resultIterator = tess.getResultIterator();
List < Rect > boxes = new ArrayList < > ();
List < String > texts = new ArrayList < > ();
while (resultIterator.next(TessBaseAPI.PageIteratorLevel.RIL_WORD)) {
Rect rect = resultIterator.getBoundingRect(TessBaseAPI.PageIteratorLevel.RIL_WORD);
String text = resultIterator.getUTF8Text(TessBaseAPI.PageIteratorLevel.RIL_WORD);
boxes.add(rect);
texts.add(text);
}