Search code examples
tesseracttess4j

OCR PDF to text


I am using Tess4J to extract the text from PDF OCR. It works great( takes a lot of time), but it doesn't detect the columns and print out lines from two columns together. though if I convert the PDF to tiff using "convert" and then run terrasect directly on the tif file on command line, it generates the text according to the column. Any idea how to make it work in Tess4J or javacpp using JAva

Following is my code for Tess4J

public static void main(String[] args)
{

    org.apache.log4j.PropertyConfigurator.configure("C://Projects//Library//Tess4J//log4j.properties.txt"); // sets
                                                                                                            // properties
                                                                                                            // file
                                                                                                            // for
                                                                                                            // log4j

    File image = new File("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//Arlen Effect Abstract.pdf");
    // recognizeTextBlocks(image.toPath());

    Tesseract tessInst = new Tesseract();
    tessInst.setDatapath("C://Projects//Library//Tess4J");
    try
    {
        String result = tessInst.doOCR(image);
        System.out.println(result);
    }
    catch (TesseractException e)
    {
        System.err.println(e.getMessage());
    }

}

Following id my code for javacpp

public static void main(String[] args)
{
    BytePointer outText;

    TessBaseAPI api = new TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api.Init("C:\\Projects\\Library\\Tess4J\\tessdata", "eng") != 0)
    {
        System.err.println("Could not initialize tesseract.");
        System.exit(1);
    }

    // Open input image with leptonica library
    PIX image = pixRead(args.length > 0 ? args[0] : "C://Users//arpit.tandon//Documents//My Received Files//SomePapers//out.tiff");
    api.SetImage(image);
    // Get OCR result
    outText = api.GetUTF8Text();
    System.out.println("OCR output:\n" + outText.getString());

    // Destroy used object and release memory
    api.End();
    outText.deallocate();
    pixDestroy(image);
}

Solution

  • I found the answer. I had to set up tessInst.setPageSegMode(3); If you look at the help section of tesseract in command line, it gives you options what number should be used for what purpose.