Search code examples
rtesseractimage-recognition

tesseract in R - read white font on black background


So, I am fairly new to tesseract and some people had similar problems as I have on this very forum but I could not get a satisfying solution, hence I am posting this question.

I have pictures from a street camera and I want to get the time stamps of the footage. After cutting out the time stamps they look like this:

Picture

I approach this problem by using tesseract with R:

library(tesseract)
library(magick)
eng <- tesseract("eng")

input <- image_read("image from above")

Using basic tesseract I get:

input %>% tesseract::ocr(,engine = eng) 
# [1] "SRE SAA PRO 206197180731 17:33:88\n"

Obviously, this doesn't help much. Therefore, after reading up on the issue I tried this:

text <- input %>%
  image_resize("2000x") %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%
  tesseract::ocr() 

cat(text)

# es bt i deen | ee) eee i ae 2s ee ee ee eee ec ee |

This result is even worse, which is really frustrating. How do I get a correct result? Any help is warmly welcome :)

EDIT

@Max Teflon answered the question for this example. However, I realised that some images are still read wrongly such as

enter image description here

enter image description here

Can anyone further improve his solution?


Solution

  • What a nice problem! It was really fun to play around with. I found this solution to work for your example:

    
    library(tesseract)
    library(magick)
    
    eng <- tesseract("eng")
    
    input <- image_read("https://i.sstatic.net/0QzhP.jpg") %>% 
      .[[1]] %>% 
      as.numeric() # cause numerics are just easier to work with
    image_read(ifelse(input <.9, 1, 0) )  # changing every non-white pixel to white and every white pixel to black
    

    So far so good, here is the black-and-white-version:

    Just trying to ocr this one did not quite work, so i tried changing the size of it:

    
    image_read(ifelse(input <.9, 1, 0) ) %>% 
      image_resize('500x') %>% # make it smaller to work around the errors
      tesseract::ocr()
    #> [1] "TLC200 PRO 2019/10/31 17:33:00\n"
    

    The resizing and the contrast-parts are just the results of playing around. You might want to change it if the solution doesn't work as good on the rest of your pictures.

    Created on 2020-01-15 by the reprex package (v0.3.0)