I am developing an application that will be used to automate invoice indexing. One use-case of my application is to extract tables from scanned documents. To do this, I need to extract the coordinates of all the words in the text (if this is not possible, I could use the coordinates of the letters as well). I plan on using Tesseract 4.0 for C# and I wanted to know if this is possible.
Thank you
You can get bounding box for each recognized word . Below is a sample code using C# Tesseract wrapper.
//intialize the TesseractEngine
using (var engine = new TesseractEngine("path to tessdata folder", "eng", EngineMode.Default))
{
//image here is Bitmap on which OCR is to be performed
using (var page = engine.Process(image, PageSegMode.Auto))
{
using (var iterator = page.GetIterator())
{
iterator.Begin();
do
{
string currentWord = iterator.GetText(PageIteratorLevel.Word);
//do something with bounds
iterator.TryGetBoundingBox(PageIteratorLevel.Word, out Rect bounds);
}
while (iterator.Next(PageIteratorLevel.Word));
}
}
}
You can now store the bounds for each word and write your logic to map them to table row/columns based on their bounding box (this is the difficult part and if your table format is neat , you should be able to get it working with some effort.). Also, consider looking at Tabula library to see if it can solve problem at hand .