how to restrict the results of tess-two (Tesseract and Leptonica library),
I want Tesseract limiting the results:
For Example:
The recognition result is "asn*&bhDK 1234 UDaks&%^jdg", then simply take is "DK1234UD".
so, don't take LowerChase, Enter, Space. Only take UperChase and numbers.
I use Java source code
this is the recognition code:
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
baseApi.setDebug(true);
baseApi.init(DATA_PATH, lang);
//setImage
baseApi.setImage(bmpOtsu);
//set whitelist
String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
//variable for recognizing
String recognizedText = baseApi.getUTF8Text();
String resultTxt = recognizedText;
baseApi.end();
if ( lang.equalsIgnoreCase("eng") ) {
recognizedText = recognizedText.replaceAll("[^A-Z0-9]", " ");
}
Can somebody tell me how can i do that? What should be added in here?
Thx to @Yazan for the answer and it's work.
and i've improve the answers.
this is my code:
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
baseApi.setDebug(true);
baseApi.init(DATA_PATH, lang);
//set variable
String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
String blackList = "\\s";
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whiteList);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, blackList);
//setImage
//baseApi.setImage(bmpOtsu, w, h, 8, (Integer) null);
baseApi.setImage(bmpOtsu);
//variable for recognizing
String recognizedText = baseApi.getUTF8Text();
recognizedText = recognizedText.replaceAll(blackList, "");//remove space
String resultTxt = recognizedText;
//
baseApi.end();
Log.v(TAG, "OCRED TEXT: " + recognizedText);
if ( lang.equalsIgnoreCase("eng") ) {
int get8digits = recognizedText.indexOf("D");
String loop = recognizedText.substring(get8digits, recognizedText.length());
if(recognizedText.contains("D") && loop.length() >= 8){
Log.w(TAG, "OPSI 1"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = recognizedText.substring(get8digits, get8digits+8);
}else if(recognizedText.contains("D") && loop.length() < 8){
Log.w(TAG, "OPSI 2"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = loop;
}else{
Log.w(TAG, "OPSI 3"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = recognizedText.replaceAll("[A-Z0-9]"," ");
}
I hope this helps anyone.