I want to do text segmentation on a printed document. I already segment the document to the character segmentation but i failed when i meet some touching character. I want to use the Tesseract OCR only to segment the word. I know Tesseract can do this task, but i dont know how to access that without digging the internal code of tesseract. Can anyone give some advice for me? If it is possible, i need that in Python.
If you can call TessBaseAPIGetComponentImages
API method, you can retrieve the segmentation at various pageIteratorLevel
levels (Symbol/Character, Word, Line, etc.) without performing actual OCR on the image.