Search code examples
javaarraysarraylisttesseract

How can you skip images Tesseract?


I have a folder with a bit more than 50k images. Here is the code i have written.

public static File folder = new File("D:\\image\\");
public static File[] listofFiles = folder.listFiles();
private static int counter;

public static void main(String[] args) {

    Tesseract tesseract = new Tesseract();
    try {
        tesseract.setDatapath("C:\\Users\\zirpm\\Documents\\Coden\\Libaries\\Tess4J\\tessdata");
        for (int i = 0; i < listofFiles.length; i++) {
            String text = tesseract.doOCR(new File("D:\\image\\"+listofFiles[i].getName()));
            counter++;
            System.out.println("Image Number: "+counter+"  "+text);
        }


    }catch (TesseractException e) {
        e.printStackTrace();
        System.out.println("TESSERACT ERROR");
    }

}

Somehow it sometimes runs in to the following error:

Cannot convert RAW image to Pix with bpp = 64
Please call SetImage before attempting recognition.net.sourceforge.tess4j.TesseractException: java.lang.NullPointerException
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
at com.krissemicolon.Main.main(Main.java:23)
Caused by: java.lang.NullPointerException
at net.sourceforge.tess4j.Tesseract.getOCRText(Unknown Source)
at net.sourceforge.tess4j.Tesseract.doOCR(Unknown Source)
... 3 more

How could you just skip the images that causes that error and moves on to the next?


Solution

  • Just add another try-catch:

    public static File folder = new File("D:\\image\\");
    public static File[] listofFiles = folder.listFiles();
    private static int counter;
    
    public static void main(String[] args) {
    
        Tesseract tesseract = new Tesseract();
        try {
            tesseract.setDatapath("C:\\Users\\zirpm\\Documents\\Coden\\Libaries\\Tess4J\\tessdata");
            for (int i = 0; i < listofFiles.length; i++) {
                try{
                    String text = tesseract.doOCR(new File("D:\\image\\"+listofFiles[i].getName()));
                }catch(TesseractException e){
                    System.out.println("Skipping "+listOfFiles[i].getName());
                }
                counter++;
                System.out.println("Image Number: "+counter+"  "+text);
            }
    
    
        }catch (TesseractException e) {
            e.printStackTrace();
            System.out.println("TESSERACT ERROR");
        }
    

    If a TesseractException occurs, it will inform you of the error and skip it.

    You may also want to remove the outer try-catch-block.