java android statistics tesseract linguistics

How to tell if a string of characters makes intelligible words

So, I'm working on a simple mobile app project (mostly for fun) that uses an OCR library (tesseract) on Android to scan a camera picture, do some stuff with the text, and return it to the user.

What I'm wondering is if anyone out there knows of a way to programmatically (or statistically) tell if a String of characters makes actual words or if it's just nonsense. (I'm only targeting the English language at this point, FYI)

For example, OCR may read a picture and it might return

String returned = "The quick brown fox."

Or, it might read another picture and return

String returned = "$. _- %/ hj @;+__~"

Obviously, the first string returned makes words and the second is just gibberish. I'm wondering if anyone has ideas for a way to easily differentiate between good return and nonsense return.

Solution

Run some character frequencies and some other statistics. I would look for the frequency and placement of whitespace, sizes of words, and frequency of symbols that I would and wouldn't expect to find in the content I expect my users to be taking pictures of.

If you're expecting large amounts of text, maybe check the frequencies on the alphabet and see if they match up with the known character frequencies in English. If you're expecting receipts, look for a lot more numbers than usual.

In the end, you could let the user decide if it's really what they wanted. All the analysis could just warn the user with a "We don't believe this is what you want" warning they could ignore.

I used concepts like these to solve a Project Euler problem about knowing when text is properly decrypted.