Search code examples
rtesseract

tesseract package in R doesn't recognize any character


I ma using R, version 3.3.2. I am trying to parse some text using new tesseract package. Image looks like this:

Image

code is simple:

library(tesseract)
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"))
text <- ocr("some_image_path.png", engine = engine)

Result is:

Too few characters. Skipping this page

Why it doesn't recognize any character?


Solution

  • Because there are Too few characters? There seems to be a limit of

    const int kMinCharactersToTry = 50;
    

    which is tested against, returning your error when it fails

    // If there are too few characters, skip this page entirely.
      if (real_max < kMinCharactersToTry / 2) {
        tprintf("Too few characters. Skipping this page\n");
        return 0;
      }
    

    Try again with a sample that has more than 25 characters?