Search code examples
textstatisticsnlpocr

What's the best way to detect garbled text in an OCR-ed document


Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.

Here is what I've looked at so far:

The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.


Any suggestions would be great! Thanks!


Solution

  • Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.