I am learning Tesseract OCR and reading this article that is based on this article. From first article:
First step is Adaptive Thresholding, which converts the image into binary images. Next step is connected component analysis which is used to extract character outlines. This method is very useful because it does the OCR of image with white text and black background. Tesseract was probably first to provide this kind of processing. Then after, the outlines are converted into Blobs. Blobs are organized into text lines, and the lines and regions are analyzed for some fixed area or equivalent text size.
Could anyone explain what is Blob?
From https://tesseract-ocr.repairfaq.org/tess_glossary.html :
Blob
Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.