Search code examples
javaandroidocrtesseractleptonica

How to limit the results of recognition?


how to restrict the results of tess-two (Tesseract and Leptonica library),
I want Tesseract limiting the results:

  1. Only take 8 digits, calculated from letter D
  2. Don't take LowerCase, Enter, Space, and Symbol
  3. Only Take Uppercase and Numbers.

For Example:
The recognition result is "asn*&bhDK 1234 UDaks&%^jdg", then simply take is "DK1234UD".
so, don't take LowerChase, Enter, Space. Only take UperChase and numbers.

I use Java source code

this is the recognition code:

    TessBaseAPI baseApi = new TessBaseAPI();
    baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
    baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
    baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
    baseApi.setDebug(true);
    baseApi.init(DATA_PATH, lang);
    //setImage
    baseApi.setImage(bmpOtsu);
    //set whitelist
    String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
    //variable for recognizing      
    String recognizedText = baseApi.getUTF8Text();
    String resultTxt = recognizedText;
    baseApi.end();

    if ( lang.equalsIgnoreCase("eng") ) {
        recognizedText = recognizedText.replaceAll("[^A-Z0-9]", " ");
    }

Can somebody tell me how can i do that? What should be added in here?


Solution

  • Thx to @Yazan for the answer and it's work.
    and i've improve the answers.
    this is my code:

            TessBaseAPI baseApi = new TessBaseAPI();
        baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
        baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
        baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
        baseApi.setDebug(true);
        baseApi.init(DATA_PATH, lang);
        //set variable
        String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        String blackList = "\\s";
        baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whiteList);
        baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, blackList);
        //setImage
        //baseApi.setImage(bmpOtsu, w, h, 8, (Integer) null);
        baseApi.setImage(bmpOtsu);
        //variable for recognizing      
        String recognizedText = baseApi.getUTF8Text();
        recognizedText = recognizedText.replaceAll(blackList, "");//remove space
        String resultTxt = recognizedText;
        //
        baseApi.end();
    
        Log.v(TAG, "OCRED TEXT: " + recognizedText);
        if ( lang.equalsIgnoreCase("eng") ) {
            int get8digits = recognizedText.indexOf("D");
            String loop = recognizedText.substring(get8digits, recognizedText.length());
            if(recognizedText.contains("D") && loop.length() >= 8){
                Log.w(TAG, "OPSI 1"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);                
                recognizedText = recognizedText.substring(get8digits, get8digits+8);                                                
            }else if(recognizedText.contains("D") && loop.length() < 8){
                Log.w(TAG, "OPSI 2"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
                recognizedText = loop;
            }else{
                Log.w(TAG, "OPSI 3"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
                recognizedText = recognizedText.replaceAll("[A-Z0-9]"," ");
    
            }
    

    I hope this helps anyone.