Search code examples
javatesseracttess4j

Tesseract: Error when read non-English text on image


I am extracting non-English text from image(bill) by using Tesseract

But I met an Error when I executed function doOCR(BufferedImage var1) (No error if set English language) :

contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511
Exception in thread "main" java.lang.Error: Invalid memory access
    at com.sun.jna.Native.invokePointer(Native Method)
    at com.sun.jna.Function.invokePointer(Function.java:470)
    at com.sun.jna.Function.invoke(Function.java:404)
    at com.sun.jna.Function.invoke(Function.java:315)
    at com.sun.jna.Library$Handler.invoke(Library.java:212)
    at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
    at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:260)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:241)

My code:

ITesseract iT = new Tesseract();
iT.setLanguage(LANGUAGE);
iT.setDatapath(System.getenv("TESSDATA_PREFIX"));
try {
  return iT.doOCR(bufferedImage);
} catch (Exception e) {
  e.getMessage();
  return "Error while reading image";
}

Some bills can extract successfully. But with some special cases, I faced that error.


Solution

  • I have solved my issue. In file pom.xml, I changed from:

    <dependency>
      <groupId>net.sourceforge.tess4j</groupId>
      <artifactId>tess4j</artifactId>
      <version>4.0.0</version>
    </dependency>
    

    to

    <dependency>
      <groupId>net.sourceforge.tess4j</groupId>
      <artifactId>tess4j</artifactId>
      <version>5.3.0</version>
    </dependency>