Search code examples
javamaventesseractexecutable-jartess4j

Tesseract For Java setting Tessdata_Prefix for executable jar


The ultimate goal of this project is to take the jar and put it in a directory where it uses tesseract and outputs a results directory and the output txt file. I am having some issues with tesseract, though. I am working with tess4j in Java with Maven and I want to make my code into an executable jar. The project works fine as a desktop app but whenever i try to run using java -jar fileName.jar(after exporting to a jar) it gives me the error

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory
Failed loading language 'eng'
...

I looked online and couldnt really find out how to set up tesseract for a jar and get the paths right. Now I use maven and have the Tesseract dependency in my pom file (tess4j -v 3.0) and I have the tessdata in my project.

I am fairly new to maven and jar files and have never used tesseract before, but as far as i can tell from the internet I set it up correctly.

Does anyone know how to make tess4j point to the tessdata directory in my project and have a dynamic path so i can move use it on multiple computers and places?

This is how I call Tesseract

    Tesseract instance = new Tesseract();
    instance.setDatapath("src/main/resources");
    String result = instance.doOCR(imageFile);
    String fileName = imageFile.getName().replace(".jpg", "");
    System.out.println("Parsed Image " + fileName);
    return result;

EDIT

This is how I tried to set the environment variable TESSDATA_PREFIX in my code

String dir = System.getProperty("user.dir");
System.out.println("current dir = " + dir);
ProcessBuilder pb = new ProcessBuilder("CMD", "/C", "SET");
Map<String, String> env = pb.environment();
env.put("TESSDATA_PREFIX", dir + "\\tessdata");
Process p = pb.start();

but this had no discernible effect. I still got the same error

EDIT 2

According to the error message I need to set it to the parent dir of the tessdata, I also tried this to no avail

EDIT 3

After a ton of searching and trying to fix it, I am not sure it is even possible. The doOcr method in tesseract takes in a buffered image or File, which would be alright if my images weren't dynamic so I cant really store them in the jar. Not to mention the fact that the TESSDATA_PREFIX still wont set. If anyone has any ideas i am all ears still and I will keep looking for a solution but im not sure it will work at all


Solution

  • It randomly started working when I

    1. put the tessdata folder in the same directory as my jar

    2. changed the setDatapath to the following

      Tesseract instance = new Tesseract();
      instance.setDatapath(".");
      String result = instance.doOCR(imageFile);
      String fileName = imageFile.getName().replace(".jpg", "");
      System.out.println("Parsed Image " + fileName);
      return result;
      

    and 3. I exported from eclipse by right clicking the project, selecting java -> runnable jar, then setting the option "Extract Required Libraries into Generated Jars".

    (side note, the environment setting like I was doing early does not need to be in the project anymore)

    I really thought I tried this but i guess something must have been wrong. I removed tessdata from my project and will have to include that wherever the jar is run. Im not really sure why it started working but im glad it did