Search code examples
javaubuntu-18.04tess4j

Tess4J on Ubuntu crashing JVM


I am new to Tess4J and to JNA, so apologies if this is obvious, but I have not been able to find in the blogs. I am on Ubuntu 18.04, running Java 17.0.1, Tomcat 10.0. I have built a simple dynamic web app, details below. I installed resources as such:

sudo apt install tesseract-ocr tesseract-ocr-rus libleptonica-dev

First I will mention that I am able to handle my test doc with no problems from the command line:

tesseract /tmp/output-0.jpg /tmp/file -l rus+eng

But when I try the same from Java the JVM crashes.

The relevant Java inside my class OCR is as follows:

    private static final String tessDir = "/usr/share/tesseract-ocr/4.00/";
    private static final String libDir = "/usr/lib/x86_64-linux-gnu";
    private ITesseract ocr = new Tesseract();
    
    public OCR() {
        System.setProperty("java.library.path", System.getProperty("java.library.path") + ":" + libDir);
        ocr.setDatapath(tessDir);
    }

    public String doOcr (String inputDirName, String outputDirName, List<File> files, Set<Lang> langs) throws IOException {
        File f1 = new File("/tmp/output-0.jpg");
        String s = "";
        ocr.setLanguage("rus+eng");
        try {
            s = ocr.doOCR(f1);
        } catch (Exception e) {
            throw new RuntimeException(e.getMessage());
        }
        return s;
    }

pom.xml:

    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna-platform</artifactId>
        <version>5.6.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.3.0</version>
    </dependency>
    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>4.6.0</version>
    </dependency>
    <dependency>
        <groupId>net.sourceforge.lept4j</groupId>
        <artifactId>lept4j</artifactId>
        <version>1.16.1</version>
    </dependency>
    <dependency>
        <groupId>org.ghost4j</groupId>
        <artifactId>ghost4j</artifactId>
        <version>1.0.1</version>
    </dependency>

The crash log looks like this:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f67aeed2c27, pid=23274, tid=23912
#
# JRE version: Java(TM) SE Runtime Environment (17.0.1+12) (build 17.0.1+12-LTS-39)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0.1+12-LTS-39, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
...
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libtesseract.so.4+0xa1c27]  tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int)+0x437

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.sun.jna.Native.invokePointer(Lcom/sun/jna/Function;JI[Ljava/lang/Object;)J+0
j  com.sun.jna.Function.invokePointer(I[Ljava/lang/Object;)Lcom/sun/jna/Pointer;+7
j  com.sun.jna.Function.invoke([Ljava/lang/Object;Ljava/lang/Class;ZI)Ljava/lang/Object;+385
j  com.sun.jna.Function.invoke(Ljava/lang/reflect/Method;[Ljava/lang/Class;Ljava/lang/Class;[Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+271
j  com.sun.jna.Library$Handler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+390
j  jdk.proxy3.$Proxy10.TessBaseAPIGetUTF8Text(Lnet/sourceforge/tess4j/ITessAPI$TessBaseAPI;)Lcom/sun/jna/Pointer;+16 jdk.proxy3
j  net.sourceforge.tess4j.Tesseract.getOCRText(Ljava/lang/String;I)Ljava/lang/String;+269
j  net.sourceforge.tess4j.Tesseract.doOCR(Ljavax/imageio/IIOImage;Ljava/lang/String;Ljava/awt/Rectangle;I)Ljava/lang/String;+18
j  net.sourceforge.tess4j.Tesseract.doOCR(Ljava/io/File;Ljava/awt/Rectangle;)Ljava/lang/String;+126
j  net.sourceforge.tess4j.Tesseract.doOCR(Ljava/io/File;)Ljava/lang/String;+3
j  mypackage.OCR.doOcr(Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Ljava/util/Set;)Ljava/lang/String;+32

In libDir are indeed libtesseract.so.4 -> libtesseract.so.4.0.0 and liblept.so -> liblept.so.5.0.2.

So what am I missing? Version mismatch somewhere?


Solution

  • Not quite sure if you are aware, but there seems to be an API available that you can simply use instead of directly pointing to your Installation Lib Folder.

    This means that this would be platform agnostic and would work whether on windows/linux.

    Example of Usage:

    The pom.xml build file

    <project>
        <modelVersion>4.0.0</modelVersion>
        <groupId>org.bytedeco.tesseract</groupId>
        <artifactId>BasicExample</artifactId>
        <version>1.5.7-SNAPSHOT</version>
        <properties>
            <exec.mainClass>BasicExample</exec.mainClass>
        </properties>
        <dependencies>
            <dependency>
                <groupId>org.bytedeco</groupId>
                <artifactId>tesseract-platform</artifactId>
                <version>5.0.0-1.5.7-SNAPSHOT</version>
            </dependency>
        </dependencies>
        <build>
            <sourceDirectory>.</sourceDirectory>
        </build>
    </project>
    

    The BasicExample.java source file

    import org.bytedeco.javacpp.*;
    import org.bytedeco.leptonica.*;
    import org.bytedeco.tesseract.*;
    import static org.bytedeco.leptonica.global.lept.*;
    import static org.bytedeco.tesseract.global.tesseract.*;
    
    public class BasicExample {
        public static void main(String[] args) {
            BytePointer outText;
    
            TessBaseAPI api = new TessBaseAPI();
            // Initialize tesseract-ocr with English, without specifying tessdata path
            if (api.Init(null, "eng") != 0) {
                System.err.println("Could not initialize tesseract.");
                System.exit(1);
            }
    
            // Open input image with leptonica library
            PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
            api.SetImage(image);
            // Get OCR result
            outText = api.GetUTF8Text();
            System.out.println("OCR output:\n" + outText.getString());
    
            // Destroy used object and release memory
            api.End();
            outText.deallocate();
            pixDestroy(image);
        }
    }
    

    Project Documentation:

    https://github.com/bytedeco/javacpp-presets/tree/master/tesseract

    Relevant StackOvervlow for V4: Using Tesseract from java