Search code examples
spring-boottomcattesseract

Tesseract in Spring Boot Application gives nonsense results for Japanese


I'm writing a Spring Boot Application that uses Bytedeco's Java Wrapper for Tesseract OCR to parse Japanese text. I've managed to get Tesseract working fine when running outside of Spring Boot, but when I use it from within the Spring Boot application it gives me nonsense results.

For example, given the following image:

坊ちゃん 夏目漱石

If I run the following function, the result is reasonable:

fun main() {
    val api = tesseract.TessBaseAPI()
    api.Init("src/main/resources/tessdata", "jpn_vert")
    api.SetPageSegMode(tesseract.PSM_SINGLE_BLOCK_VERT_TEXT)
    val pixImage = lept.pixRead("src/main/resources/image.png")
    api.SetImage(pixImage)
    val result = api.GetUTF8Text()
    System.out.println("Parsed text: " + result?.string)
}

Prints:

Parsed text:
坊っちゃん
夏目 滞 石

If I run it from within a Spring Boot Web Socket, however, the result is not:

@SpringBootApplication
open class BootApplication

fun main(args: Array<String>) {
    runApplication<BootApplication>(*args)
}

@Configuration
@EnableWebSocket
open class WebSocketConfiguration: WebSocketConfigurer {
    @Bean
    open fun createWebSocketContainer(): ServletServerContainerFactoryBean {
        val container = ServletServerContainerFactoryBean()
        container.maxBinaryMessageBufferSize = 1024000
        return container
    }

    override fun registerWebSocketHandlers(registry: WebSocketHandlerRegistry) {
        registry.addHandler(Endpoint(), "/parse").setAllowedOrigins("*")
    }
}

class Endpoint: AbstractWebSocketHandler() {
    @Throws(IOException::class)
    override fun handleBinaryMessage(session: WebSocketSession?, message: BinaryMessage?) {
        // Same code as above:
        val api = tesseract.TessBaseAPI()
        api.Init("src/main/resources/tessdata", "jpn_vert")
        api.SetPageSegMode(tesseract.PSM_SINGLE_BLOCK_VERT_TEXT)
        val pixImage = lept.pixRead("src/main/resources/image.png")
        api.SetImage(pixImage)
        val result = api.GetUTF8Text()
        System.out.println("Parsed text:\n" + result?.string)
    }
}

Prints the following when handleBinaryMessage is called:

Parsed text:
蝮翫▲縺。繧?繧?
螟冗岼 貊? 遏ウ

I ran a quick test on some English text and that worked fine, so I assume this issue is language-specific.

I'm running the Boot application with the bootRun task from the Spring Boot Gradle plugin which starts an Apache Tomcat service. My first thought is that this has something to do with the fact that the Tesseract wrapper is a JNI library and the environment it's running in (Tomcat) isn't the same. If that's the case, is there some extra configuration that needs to be done to get Tesseract to work with Spring Boot and Tomcat?

For reference, my build.gradle is as follows:

plugins {
    id 'org.jetbrains.kotlin.jvm' version '1.3.10'
    id("org.springframework.boot") version "2.1.0.RELEASE"
}

repositories {
    mavenCentral()
}

dependencies {
    implementation "org.jetbrains.kotlin:kotlin-stdlib-jdk8"
    implementation group: "org.bytedeco.javacpp-presets", name: "tesseract", version: "4.0.0-rc2-1.4.3"
    implementation group: "org.bytedeco.javacpp-presets", name: "tesseract", version: "4.0.0-rc2-1.4.3", classifier: "windows-x86_64"
    implementation group: "org.bytedeco.javacpp-presets", name: "leptonica", version: "1.76.0-1.4.3", classifier: "windows-x86_64"
    implementation group: "org.springframework.boot", name: "spring-boot", version: "2.1.0.RELEASE"
    implementation group: "org.springframework.boot", name: "spring-boot-starter-web", version: "2.1.0.RELEASE"
    implementation group: "org.springframework", name: "spring-websocket", version: "5.1.2.RELEASE"
}

compileKotlin {
    kotlinOptions.jvmTarget = "1.8"
}

Solution

  • Edit

    Looks like it was an encoding issue. Java's file.encoding system property was set to UTF-8 when running from outside of Boot, but set to windows-31j when in Boot. Switching result?.string to result?.getString("UTF-8") fixed it.


    I'm going to chalk this up to a bug with Bytedeco's Tesseract wrapper. I did the equivalent test with tess4j and had no problems:

    val imageFile = File("src/main/resources/image.png")
    val tess = Tesseract()
    tess.setPageSegMode(PSM_SINGLE_BLOCK_VERT_TEXT)
    tess.setLanguage("jpn_vert")
    System.out.println("Parsed text:\n" + tess.doOCR(imageFile))
    

    Gives me:

    坊っちゃん
    夏目 滞 石
    

    as expected.

    I've filed a ticket on their github, so hopefully this will be cleared up before long.