Search code examples
visual-studio-2010visual-c++ocrtesseract

How to extract multiple lines from an image using Tesseract OCR?


  1. We have passed an image with single line having the text "Hello World" and the Tesseract OCR perfectly show the result 'Hello World'.

  2. But when we passed an image with multiple lines text

Hello world
How are you

it doesn't show anything.

Here is our codes:

#include "stdafx.h"
#include <iostream>
#include <baseapi.h>
#include <allheaders.h>
#include <fstream>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    tesseract::TessBaseAPI api;

    api.Init("", "eng", tesseract::OEM_DEFAULT);
    api.SetPageSegMode(static_cast<tesseract::PageSegMode>(7));
    api.SetOutputName("out");

    cout<<"File name:";
    char image[256];
    cin>>image;
    PIX   *pixs = pixRead(image);

    STRING text_out;
    api.ProcessPages(image, NULL, 0, &text_out);

    cout<<text_out.string();

    ofstream files;
    files.open("out.txt");
    files << text_out.string()<<endl;
    files.close();

    cin>> image;
    return 0;
}

input with 1 line

output with 1 line

input with 2 lines

output with 2 lines


Solution

  • Page Segmentation Mode 7 treats the image as a single text line. Try 3, which is Fully automatic page segmentation, but no OSD (default).