Search code examples
ocrtesseract

How to improve read tesseract accuracy?


I want to get the following expected result. Can you give me any suggestions to improve the result?

  • Input image

Input image

  • Expected result
流 動 資 産
固 定 資 産
  • Actual result
産 産
資 資
動 定
  • To reproduce the result
$ git clone https://github.com/zono/ocr.git
$ cd ocr
$ git checkout 0f2541eac302dd1fe2efbbd3b36e7ba40a99d232
$ docker-compose up -d
$ docker exec -it ocr /bin/bash
# /usr/local/bin/tesseract /ocr/src/bssample7.png stdout -l jpn
産 産
資 資
動 定
  • Versions
$ docker -v
Docker version 19.03.5, build 633a0ea

# tesseract -v
tesseract 4.1.1-rc2-22-g08899
 leptonica-1.79.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Solution

  • you need to use another page-segmentation-method to get the expected result.

    Try to append --psm 6 to your command to make it look like this:

    $ tesseract /ocr/src/bssample7.png outputfilename -l jpn --psm 6
    

    Here you can read about the different methods:

    https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

    Kind regards