I am using Tess4J to extract the text from PDF OCR. It works great( takes a lot of time), but it doesn't detect the columns and print out lines from two columns together. though if I convert the PDF to tiff using "convert" and then run terrasect directly on the tif file on command line, it generates the text according to the column. Any idea how to make it work in Tess4J or javacpp using JAva
Following is my code for Tess4J
public static void main(String[] args)
{
org.apache.log4j.PropertyConfigurator.configure("C://Projects//Library//Tess4J//log4j.properties.txt"); // sets
// properties
// file
// for
// log4j
File image = new File("C://Users//arpit.tandon//Documents//My Received Files//SomePapers//Arlen Effect Abstract.pdf");
// recognizeTextBlocks(image.toPath());
Tesseract tessInst = new Tesseract();
tessInst.setDatapath("C://Projects//Library//Tess4J");
try
{
String result = tessInst.doOCR(image);
System.out.println(result);
}
catch (TesseractException e)
{
System.err.println(e.getMessage());
}
}
Following id my code for javacpp
public static void main(String[] args)
{
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api.Init("C:\\Projects\\Library\\Tess4J\\tessdata", "eng") != 0)
{
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
// Open input image with leptonica library
PIX image = pixRead(args.length > 0 ? args[0] : "C://Users//arpit.tandon//Documents//My Received Files//SomePapers//out.tiff");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
System.out.println("OCR output:\n" + outText.getString());
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
}
I found the answer. I had to set up tessInst.setPageSegMode(3); If you look at the help section of tesseract in command line, it gives you options what number should be used for what purpose.