Search code examples
node.jsopencvocrtesseractimage-preprocessing

How to preprocess black text on a cream background for Tesseract using OpenCV?


I am looking to extract text from this image: screenshot

Specifically the row under "Kills". However I cannot seem to get accurate results.

I tried to convert the image to gray and apply a threshhold:

import { createWorker, OEM, PSM } from "tesseract.js";
import cv from "@u4/opencv4nodejs";
import fs from "node:fs/promises";

const worker = await createWorker("eng", OEM.TESSERACT_LSTM_COMBINED);

await options.worker.setParameters({
  tessedit_char_whitelist: "0123456789",
  tessedit_pageseg_mode: PSM.SINGLE_BLOCK,
});

const image = await cv.imdecodeAsync(
  await fs.readFile("input.png"),
  cv.COLOR_BGR2GRAY
);

const threshHoldedImage =
  await image.thresholdAsync(
    150,
    255,
    cv.THRESH_BINARY
  );

const blurredImage = await cv.imencodeAsync(".png", threshHoldedImage);

const {
  data: { text: tierKillsText },
} = await options.worker.recognize(blurredImage, {
  rectangle: {
    top: 265,
    left: 552,
    width: 87,
    height: 138,
  },
});

console.log(tierKillsText);
// Received: 3228387
// Expected: 3328387

I have also tried to apply a gaussian blur without success:

const sigma = 0.75;
const blurred = threshHoldedImage.gaussianBlur(new cv.Size(0, 0), sigma);

Solution

  • I fixed it by reading out each line individually which seems to lead to more accurate results