Search code examples
pythonimage-processingtesseractpython-tesseracttess4j

tesseract is giving highly inconsistent result


I want get result of match which is in format of image . Below is the code I'm using to read text from image .I have used python code it also gives same result. How can i improve the output or is there any other better way for my problem .

     public String getImgText(String imageLocation) {
      ITesseract instance = new Tesseract();

      try 
      {
          instance.setDatapath("/tessdata");
          instance.setLanguage("eng");
         String imgText = instance.doOCR(new File(imageLocation));

         return imgText;
      } 
      catch (TesseractException e) 
      {
         e.getMessage();
         return "Error while reading image";
      }
   } 

output is totally different of input

unnl lE

mam-m m,

mun-m, 1 ms "mm M

W urn-mm my A mm“ m

mus-1mm 1 m- m m

mfinlln um: ”mu“ m

ilk-M m.

mwnm mu 5 mm nu-

..mn. n w. tvhrzmr- m

2 rm.“- 0 w, mama: m.

mum-mp 5 mu mum n.

a bulb-h» m

tum-3mm nun mm,” M

3 mmn m; mum“ M

Ema W 7 a“. m

mzsm 5m mm»... m
Continue

input image is

enter image description here


Solution

  • You should preprocess the image before running Tesseract (python code with opencv library):

    import cv2
    
    img = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    result = cv2.bitwise_not(img)
    result[result >= 190] = 255
    
    # To show the image
    cv2.imshow("Threshold", result)
    cv2.waitKey()
    

    Resulting in something like this: enter image description here

    Additionally it seems the English traineddata handles the PUBG font poorly so you might wanna look into finetuning it: Training eng.traineddata for PUBG font