I'm OCRing few sample images. I have manually read and stored text contained in these images in a separate text file.
I'm looking to test my OCR success rate. So, I'm looking for an algorithm that would tell me the a success percentage when comparing the OCR'd text vs. the one I manually read and stored.
The key thing is that if there is a space between words, i don't want to tag that as a complete failure.
For example:
Example 1:
Actual Text: Treadstone is a great tire
OCR'd text v1: Treadstone is a great tire (100%)
OCR'd text v2: Tread stone is a great tire (~90%)
OCR'd text v3: Tread stone tire great is a (same as v2)
OCR'd text v4: Freadstone is a freat tyre (~80%)
Is there a known algorithm that I can use for this? If not, what is an approach I should adopt for calculating this success percentage?
Consider using the Levenshtein string edit distance. You can fine-tune it by assigning different penalties to space insertion/deletion than for other characters.
You'll probably need to set a maximum allowed distance, to limit the running time on long strings.