I use the following function to perform offline OCR using Tesseract OCR's Android fork Tess-Two :
private String startOCR(Uri imgUri) {
try {
ExifInterface exif = new ExifInterface(imgUri.getPath());
int exifOrientation = exif.getAttributeInt(ExifInterface.TAG_ORIENTATION, ExifInterface.ORIENTATION_NORMAL);
int rotate = 0;
switch(exifOrientation) {
case ExifInterface.ORIENTATION_ROTATE_90:
rotate = 90;
break;
case ExifInterface.ORIENTATION_ROTATE_180:
rotate = 180;
break;
case ExifInterface.ORIENTATION_ROTATE_270:
rotate = 270;
break;
}
Log.d(TAG, "Rotation: " + rotate);
BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 4; // 1 - means max size. 4 - means maxsize/4 size. Don't use value <4, because you need more memory in the heap to store your data.
// set to 300 dpi
options.inTargetDensity = 300;
Bitmap bitmap = BitmapFactory.decodeFile(imgUri.getPath(), options);
// Change Orientation via EXIF
if (rotate != 0) {
// Getting width & height of the given image.
int w = bitmap.getWidth();
int h = bitmap.getHeight();
// Setting pre rotate
Matrix mtx = new Matrix();
mtx.preRotate(rotate);
// Rotating Bitmap
bitmap = Bitmap.createBitmap(bitmap, 0, 0, w, h, mtx, false);
}
// To Grayscale
bitmap = toGrayscale(bitmap);
final Bitmap b = bitmap;
final ImageView ivResult = (ImageView)findViewById(R.id.ivResult);
if(ivResult != null) {
runOnUiThread(new Runnable() {
@Override
public void run() {
ivResult.setImageBitmap(b);
}
});
}
return extractText(bitmap);
} catch (Exception e) {
Log.e(TAG, e.getMessage());
return "";
}
}
and here is the extractText()
method:
private String extractText(Bitmap bitmap) {
//Log.d(TAG, "extractText");
try {
tessBaseApi = new TessBaseAPI();
} catch (Exception e) {
Log.e(TAG, e.getMessage());
if (tessBaseApi == null) {
Log.e(TAG, "TessBaseAPI is null. TessFactory not returning tess object.");
}
}
tessBaseApi.init(DATA_PATH, lang);
//EXTRA SETTINGS
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
Log.d(TAG, "Training file loaded");
tessBaseApi.setDebug(true);
tessBaseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_AUTO_OSD);
tessBaseApi.setImage(bitmap);
String extractedText = "empty result";
try {
extractedText = tessBaseApi.getUTF8Text();
} catch (Exception e) {
Log.e(TAG, "Error in recognizing text.");
}
tessBaseApi.end();
return extractedText;
}
The value returned by extractText()
is shown in the following screenshot:
Accuracy is super low, though I make the image grayscale & upscale to 300 dpi before performing OCR. How can I improve the results? Is the trained data not good enough?
I've made some tests, however, I have some points and conclusions that could improve your result.
See my results for this input:
a) Lowercase only:
Parameter:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "abcdefghijklmnopqrstuvwxyz1234567890',.?;/ ");
Result:
05 atenienses nnito, hdeleto e laicao, os principais acusadores de gocrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a liornero. nristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica grega. 0 metodo de socrates era o oposto a narrativa epica de tlornero. sua dialetica nao tinha nada de semideuses corn superpoderes 6
b) Uppercase and Lowercase letters:
Parameter:
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ1234567890',.?;/ ");
Result:
Os atenienses Anito, Meleto e Licao, os principais acusadores de Socrates, nao defendiam apenas que o filosofo corrompia a juventude; eles lutavam tama bern pelas virtudes da tradigao poetica vinculada a Homero. Aristofanes, um dos responsaveis, segundo socrates, dos preconceitos contra o filosofo, era outro grande defensor dessa virtude.
socrates, de certa forma, estava em guerra com a tradieao poetica grega. O metodo de socrates era o Oposto a narrativa epica de Homero. Sua dialetica nao tinha nada de semideuses corn superpoderes 6
PS: I've ran this example using Portuguese language, check that in some words that need different chars like: 'é ó ç' it didn't work, because it wasn't passed as char into white list.
I also tried to ran using your picture, the result has improved (not so much):
Font 20; Which polrlrcran has caplured Ihe curve, summed up a growing mood. In a Ierocrous speech? 'Your iron industry is dead. dead as munon. Your coal yum mono greatly on the iron Vbur Ilk Mary is and. o Your woolen induslry is Why. Your canon Mr Wilding induslry. blmailf
So i checked how tesseract binarized the image:
Your image have so much noise, then the api try to binarize your image that made a huge part of your picture illegible. I suggest you try run again, but without pass to grayscale, and try to research how to decrease the noise in your image.
To help you in your debug task, you can save the theresholded image:
WriteFile.writeBitmap(baseApi.getThresholdedImage())
I hope that it would be useful for you! Thank you for sharing your issue!
Abraços!