Search code examples
javatesseracttess4j

Tess4j tesseract - How can you differentiate between columns or rows in a table?


I am working a bit with tess4j tesseract in Java. It works well and it allows me to do what I need.

But I have come across an issue that I cannot solve without guidance or help.

Let us say, I have the following image:

enter image description here

This then provides me with the following output:

Column 1 Column 2 Column3

Row 1 Column 1 Rowt Column 3

Row 2 Column 1 Row 2 Column 2 Row 2 Column 3

Here is my code

    String readFile(String inputFilePath){
    Tesseract tesseract = new Tesseract();
    tesseract.setDatapath(path);
    tesseract.setLanguage("eng");
    tesseract.setTessVariable("user_defined_dpi", "300");

    String string = null;
    try {
        string = tesseract.doOCR(new File(inputFilePath));
    } catch (TesseractException e) {
        e.printStackTrace();
    }
    return string;
}

Is there a way in which I can achieve a result that mimics what is in the image? So I can differentiate between the columns.


Solution

  • You can preserve the spaces and then count them:

    tesseract.setTessVariable("preserve_interword_spaces", "1");