Are there any good NLP or statistical techniques for detecting garbled characters in OCR-ed text? Off the top of my head I was thinking that looking at the distribution of n-grams in text might be a good starting point but I'm pretty new to the whole NLP domain.
Here is what I've looked at so far:
The text will mostly be in english but a general solution would be nice. The text is currently indexed in Lucene so any ideas on a term based approach would be useful too.
Any suggestions would be great! Thanks!
Yes, most powerful thing in that case is Ngrams. You should collect them on related text corpora (with same topic to your OCR texts). This problem is very similar to spellchecking - if small character change lead to great probability increase it was a mistake. Check this tutorial how to use ngram for spellchecking.