Search code examples
c#tesseract

Coordinate extraction using Tesseract 4.0


I am developing an application that will be used to automate invoice indexing. One use-case of my application is to extract tables from scanned documents. To do this, I need to extract the coordinates of all the words in the text (if this is not possible, I could use the coordinates of the letters as well). I plan on using Tesseract 4.0 for C# and I wanted to know if this is possible.

Thank you


Solution

  • You can get bounding box for each recognized word . Below is a sample code using C# Tesseract wrapper.

     //intialize the TesseractEngine
      using (var engine = new TesseractEngine("path to tessdata folder", "eng", EngineMode.Default))
      {
          //image here is Bitmap on which OCR is to be performed
          using (var page = engine.Process(image, PageSegMode.Auto))
          {
              using (var iterator = page.GetIterator())
              {
    
                  iterator.Begin();
                  do
                  {
                      string currentWord = iterator.GetText(PageIteratorLevel.Word);
                      //do something with bounds 
                      iterator.TryGetBoundingBox(PageIteratorLevel.Word, out Rect bounds);                                   
                   }
                   while (iterator.Next(PageIteratorLevel.Word));
              }
          }
       }
    

    You can now store the bounds for each word and write your logic to map them to table row/columns based on their bounding box (this is the difficult part and if your table format is neat , you should be able to get it working with some effort.). Also, consider looking at Tabula library to see if it can solve problem at hand .