python algorithm language-agnostic string-comparison levenshtein-distance

closest string match for comparing OCR results

I'm OCRing few sample images. I have manually read and stored text contained in these images in a separate text file.

I'm looking to test my OCR success rate. So, I'm looking for an algorithm that would tell me the a success percentage when comparing the OCR'd text vs. the one I manually read and stored.

The key thing is that if there is a space between words, i don't want to tag that as a complete failure.

For example:

Example 1:

Actual Text: Treadstone is a great tire 
OCR'd text v1: Treadstone is a great tire (100%)
OCR'd text v2: Tread stone is a great tire (~90%)
OCR'd text v3: Tread stone tire great is a (same as v2)
OCR'd text v4: Freadstone is a freat tyre (~80%)

Is there a known algorithm that I can use for this? If not, what is an approach I should adopt for calculating this success percentage?

Solution

Consider using the Levenshtein string edit distance. You can fine-tune it by assigning different penalties to space insertion/deletion than for other characters.

You'll probably need to set a maximum allowed distance, to limit the running time on long strings.