I am trying to preprocess this image.
With Tesseract I then try to read the numbers on the right like:
const COORDINATES = [
MORE_INFO_LABELS: {
x: 740,
y: 165,
w: 112,
h: 326,
},
];
const worker = await createWorker("eng", OEM.TESSERACT_LSTM_COMBINED);
await worker.setParameters({
tessedit_char_whitelist: "0123456789",
tessedit_pageseg_mode: PSM.SINGLE_LINE,
});
const moreInfoScreenshot = await cv.imdecodeAsync(
await fs.readFile("test.png"),
cv.IMREAD_GRAYSCALE
);
const binaryImage = moreInfoScreenshot.adaptiveThreshold(
255,
cv.ADAPTIVE_THRESH_GAUSSIAN_C,
cv.THRESH_BINARY_INV,
11,
2
);
const moreInfoScreenshotPNG = await cv.imencodeAsync(".png", binaryImage);
await cv.imwriteAsync("test-fmt.png", binaryImage);
function coordinatesToRectangle(coordinates: Required<Coordinate>) {
return {
top: coordinates.y,
left: coordinates.x,
width: coordinates.w,
height: coordinates.h,
};
}
const {
data: { text: moreInfoText },
} = await options.worker.recognize(moreInfoScreenshotPNG, {
rectangle: coordinatesToRectangle(COORDINATES.MORE_INFO_LABELS),
});
The output image looks like this. The problem is that tesseract does not read the smaller numbers (moreInfoText: '100408218\n18870369\n26783840937\n3330133360\n215735\n'
). How can I make sure those are read properly?
You're lucky in that your image's background is mostly blue. When reading the image with colors (warning: OpenCV defaults to BGR, not RGB), you can extract each color channel with cv.split()
:
As you can see, since the background is blue, the red color is already wonderfully clean. Now you can either try running Tesseract on that, or additionally threshold and invert the image. I'm using OpenCV-python to demonstrate, but the OpenCV functions are the same:
import cv2 as cv
img = cv.imread("image.png")
_, _, r = cv.split(img)
_, thresh = cv.threshold(r, 85, 255,cv.THRESH_BINARY)
Now that the image is thresholded, you can just use cv.bitwise_not()
to invert the colors:
cv.bitwise_not(thresh, thresh)
Going further, looking at the cropped segment:
we can see that you'll want to add ,
to the character whitelist, or otherwise Tesseract will probably guess a number instead. You can just .replace(/,/g, '')
in JS to strip out the commas later.
Additionally, this is more like a block of text, so try using one of the page segmentation modes (PSM)s for blocks of text. Alternatively, slice each line separately and run them as lines of text.
Going further, the Tesseract docs has a page on improving quality: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html The erosion/dilation operation is one or two lines of code with OpenCV, and are covered in their documentation: https://docs.opencv.org/4.8.0/d4/d76/tutorial_js_morphological_ops.html